Spaces:
Running
Running
# Neural Machine Translation with Byte-Level Subwords | |
https://arxiv.org/abs/1909.03341 | |
We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as | |
example. | |
## Data | |
Get data and generate fairseq binary dataset: | |
```bash | |
bash ./get_data.sh | |
``` | |
## Model Training | |
Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`): | |
```bash | |
# VOCAB=bytes | |
# VOCAB=chars | |
VOCAB=bbpe2048 | |
# VOCAB=bpe2048 | |
# VOCAB=bbpe4096 | |
# VOCAB=bpe4096 | |
# VOCAB=bpe16384 | |
``` | |
```bash | |
fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ | |
--arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \ | |
--optimizer adam --adam-betas '(0.9, 0.98)' \ | |
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ | |
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ | |
--log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \ | |
--batch-size 100 --max-update 100000 --update-freq 2 | |
``` | |
## Generation | |
`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters: | |
```bash | |
# BPE=--bpe bytes | |
# BPE=--bpe characters | |
BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model | |
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model | |
# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model | |
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model | |
# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model | |
``` | |
```bash | |
fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ | |
--source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \ | |
--tokenizer moses --moses-target-lang en ${BPE} | |
``` | |
When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions: | |
```bash | |
fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \ | |
--path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \ | |
--moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000 | |
``` | |
## Results | |
| Vocabulary | Model | BLEU | | |
|:-------------:|:-------------:|:-------------:| | |
| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 | | |
| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) | | |
| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) | | |
| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) | | |
| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) | | |
| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) | | |
| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) | | |
| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) | | |
## Citation | |
``` | |
@misc{wang2019neural, | |
title={Neural Machine Translation with Byte-Level Subwords}, | |
author={Changhan Wang and Kyunghyun Cho and Jiatao Gu}, | |
year={2019}, | |
eprint={1909.03341}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CL} | |
} | |
``` | |
## Contact | |
Changhan Wang ([[email protected]](mailto:[email protected])), | |
Kyunghyun Cho ([[email protected]](mailto:[email protected])), | |
Jiatao Gu ([[email protected]](mailto:[email protected])) | |