|
# Neural Machine Translation with Byte-Level Subwords |
|
|
|
https://arxiv.org/abs/1909.03341 |
|
|
|
We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as |
|
example. |
|
|
|
## Data |
|
Get data and generate fairseq binary dataset: |
|
```bash |
|
bash ./get_data.sh |
|
``` |
|
|
|
## Model Training |
|
Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`): |
|
```bash |
|
# VOCAB=bytes |
|
# VOCAB=chars |
|
VOCAB=bbpe2048 |
|
# VOCAB=bpe2048 |
|
# VOCAB=bbpe4096 |
|
# VOCAB=bpe4096 |
|
# VOCAB=bpe16384 |
|
``` |
|
```bash |
|
fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ |
|
--arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9, 0.98)' \ |
|
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \ |
|
--batch-size 100 --max-update 100000 --update-freq 2 |
|
``` |
|
|
|
## Generation |
|
`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters: |
|
```bash |
|
# BPE=--bpe bytes |
|
# BPE=--bpe characters |
|
BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model |
|
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model |
|
# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model |
|
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model |
|
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model |
|
``` |
|
|
|
```bash |
|
fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ |
|
--source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \ |
|
--tokenizer moses --moses-target-lang en ${BPE} |
|
``` |
|
When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions: |
|
```bash |
|
fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ |
|
--path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \ |
|
--moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000 |
|
``` |
|
|
|
## Results |
|
| Vocabulary | Model | BLEU | |
|
|:-------------:|:-------------:|:-------------:| |
|
| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 | |
|
| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) | |
|
| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) | |
|
| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) | |
|
| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) | |
|
| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) | |
|
| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) | |
|
| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) | |
|
|
|
|
|
## Citation |
|
``` |
|
@misc{wang2019neural, |
|
title={Neural Machine Translation with Byte-Level Subwords}, |
|
author={Changhan Wang and Kyunghyun Cho and Jiatao Gu}, |
|
year={2019}, |
|
eprint={1909.03341}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
|
|
## Contact |
|
Changhan Wang ([[email protected]](mailto:[email protected])), |
|
Kyunghyun Cho ([[email protected]](mailto:[email protected])), |
|
Jiatao Gu ([[email protected]](mailto:[email protected])) |
|
|