|
# Understanding Back-Translation at Scale (Edunov et al., 2018) |
|
|
|
This page includes pre-trained models from the paper [Understanding Back-Translation at Scale (Edunov et al., 2018)](https://arxiv.org/abs/1808.09381). |
|
|
|
## Pre-trained models |
|
|
|
Model | Description | Dataset | Download |
|
---|---|---|--- |
|
`transformer.wmt18.en-de` | Transformer <br> ([Edunov et al., 2018](https://arxiv.org/abs/1808.09381)) <br> WMT'18 winner | [WMT'18 English-German](http://www.statmt.org/wmt18/translation-task.html) | [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/wmt18.en-de.ensemble.tar.gz) <br> See NOTE in the archive |
|
|
|
## Example usage (torch.hub) |
|
|
|
We require a few additional Python dependencies for preprocessing: |
|
```bash |
|
pip install subword_nmt sacremoses |
|
``` |
|
|
|
Then to generate translations from the full model ensemble: |
|
```python |
|
import torch |
|
|
|
# List available models |
|
torch.hub.list('pytorch/fairseq') # [..., 'transformer.wmt18.en-de', ... ] |
|
|
|
# Load the WMT'18 En-De ensemble |
|
en2de_ensemble = torch.hub.load( |
|
'pytorch/fairseq', 'transformer.wmt18.en-de', |
|
checkpoint_file='wmt18.model1.pt:wmt18.model2.pt:wmt18.model3.pt:wmt18.model4.pt:wmt18.model5.pt', |
|
tokenizer='moses', bpe='subword_nmt') |
|
|
|
# The ensemble contains 5 models |
|
len(en2de_ensemble.models) |
|
# 5 |
|
|
|
# Translate |
|
en2de_ensemble.translate('Hello world!') |
|
# 'Hallo Welt!' |
|
``` |
|
|
|
## Training your own model (WMT'18 English-German) |
|
|
|
The following instructions can be adapted to reproduce the models from the paper. |
|
|
|
|
|
#### Step 1. Prepare parallel data and optionally train a baseline (English-German) model |
|
|
|
First download and preprocess the data: |
|
```bash |
|
# Download and prepare the data |
|
cd examples/backtranslation/ |
|
bash prepare-wmt18en2de.sh |
|
cd ../.. |
|
|
|
# Binarize the data |
|
TEXT=examples/backtranslation/wmt18_en_de |
|
fairseq-preprocess \ |
|
--joined-dictionary \ |
|
--source-lang en --target-lang de \ |
|
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ |
|
--destdir data-bin/wmt18_en_de --thresholdtgt 0 --thresholdsrc 0 \ |
|
--workers 20 |
|
|
|
# Copy the BPE code into the data-bin directory for future use |
|
cp examples/backtranslation/wmt18_en_de/code data-bin/wmt18_en_de/code |
|
``` |
|
|
|
(Optionally) Train a baseline model (English-German) using just the parallel data: |
|
```bash |
|
CHECKPOINT_DIR=checkpoints_en_de_parallel |
|
fairseq-train --fp16 \ |
|
data-bin/wmt18_en_de \ |
|
--source-lang en --target-lang de \ |
|
--arch transformer_wmt_en_de_big --share-all-embeddings \ |
|
--dropout 0.3 --weight-decay 0.0 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ |
|
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ |
|
--max-tokens 3584 --update-freq 16 \ |
|
--max-update 30000 \ |
|
--save-dir $CHECKPOINT_DIR |
|
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a |
|
# different number of GPUs. |
|
``` |
|
|
|
Average the last 10 checkpoints: |
|
```bash |
|
python scripts/average_checkpoints.py \ |
|
--inputs $CHECKPOINT_DIR \ |
|
--num-epoch-checkpoints 10 \ |
|
--output $CHECKPOINT_DIR/checkpoint.avg10.pt |
|
``` |
|
|
|
Evaluate BLEU: |
|
```bash |
|
# tokenized BLEU on newstest2017: |
|
bash examples/backtranslation/tokenized_bleu.sh \ |
|
wmt17 \ |
|
en-de \ |
|
data-bin/wmt18_en_de \ |
|
data-bin/wmt18_en_de/code \ |
|
$CHECKPOINT_DIR/checkpoint.avg10.pt |
|
# BLEU4 = 29.57, 60.9/35.4/22.9/15.5 (BP=1.000, ratio=1.014, syslen=63049, reflen=62152) |
|
# compare to 29.46 in Table 1, which is also for tokenized BLEU |
|
|
|
# generally it's better to report (detokenized) sacrebleu though: |
|
bash examples/backtranslation/sacrebleu.sh \ |
|
wmt17 \ |
|
en-de \ |
|
data-bin/wmt18_en_de \ |
|
data-bin/wmt18_en_de/code \ |
|
$CHECKPOINT_DIR/checkpoint.avg10.pt |
|
# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 29.0 60.6/34.7/22.4/14.9 (BP = 1.000 ratio = 1.013 hyp_len = 62099 ref_len = 61287) |
|
``` |
|
|
|
|
|
#### Step 2. Back-translate monolingual German data |
|
|
|
Train a reverse model (German-English) to do the back-translation: |
|
```bash |
|
CHECKPOINT_DIR=checkpoints_de_en_parallel |
|
fairseq-train --fp16 \ |
|
data-bin/wmt18_en_de \ |
|
--source-lang de --target-lang en \ |
|
--arch transformer_wmt_en_de_big --share-all-embeddings \ |
|
--dropout 0.3 --weight-decay 0.0 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ |
|
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ |
|
--max-tokens 3584 --update-freq 16 \ |
|
--max-update 30000 \ |
|
--save-dir $CHECKPOINT_DIR |
|
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a |
|
# different number of GPUs. |
|
``` |
|
|
|
Let's evaluate the back-translation (BT) model to make sure it is well trained: |
|
```bash |
|
bash examples/backtranslation/sacrebleu.sh \ |
|
wmt17 \ |
|
de-en \ |
|
data-bin/wmt18_en_de \ |
|
data-bin/wmt18_en_de/code \ |
|
$CHECKPOINT_DIR/checkpoint_best.py |
|
# BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 34.9 66.9/41.8/28.5/19.9 (BP = 0.983 ratio = 0.984 hyp_len = 63342 ref_len = 64399) |
|
# compare to the best system from WMT'17 which scored 35.1: http://matrix.statmt.org/matrix/systems_list/1868 |
|
``` |
|
|
|
Next prepare the monolingual data: |
|
```bash |
|
# Download and prepare the monolingual data |
|
# By default the script samples 25M monolingual sentences, which after |
|
# deduplication should be just over 24M sentences. These are split into 25 |
|
# shards, each with 1M sentences (except for the last shard). |
|
cd examples/backtranslation/ |
|
bash prepare-de-monolingual.sh |
|
cd ../.. |
|
|
|
# Binarize each shard of the monolingual data |
|
TEXT=examples/backtranslation/wmt18_de_mono |
|
for SHARD in $(seq -f "%02g" 0 24); do \ |
|
fairseq-preprocess \ |
|
--only-source \ |
|
--source-lang de --target-lang en \ |
|
--joined-dictionary \ |
|
--srcdict data-bin/wmt18_en_de/dict.de.txt \ |
|
--testpref $TEXT/bpe.monolingual.dedup.${SHARD} \ |
|
--destdir data-bin/wmt18_de_mono/shard${SHARD} \ |
|
--workers 20; \ |
|
cp data-bin/wmt18_en_de/dict.en.txt data-bin/wmt18_de_mono/shard${SHARD}/; \ |
|
done |
|
``` |
|
|
|
Now we're ready to perform back-translation over the monolingual data. The |
|
following command generates via sampling, but it's possible to use greedy |
|
decoding (`--beam 1`), beam search (`--beam 5`), |
|
top-k sampling (`--sampling --beam 1 --sampling-topk 10`), etc.: |
|
```bash |
|
mkdir backtranslation_output |
|
for SHARD in $(seq -f "%02g" 0 24); do \ |
|
fairseq-generate --fp16 \ |
|
data-bin/wmt18_de_mono/shard${SHARD} \ |
|
--path $CHECKPOINT_DIR/checkpoint_best.pt \ |
|
--skip-invalid-size-inputs-valid-test \ |
|
--max-tokens 4096 \ |
|
--sampling --beam 1 \ |
|
> backtranslation_output/sampling.shard${SHARD}.out; \ |
|
done |
|
``` |
|
|
|
After BT, use the `extract_bt_data.py` script to re-combine the shards, extract |
|
the back-translations and apply length ratio filters: |
|
```bash |
|
python examples/backtranslation/extract_bt_data.py \ |
|
--minlen 1 --maxlen 250 --ratio 1.5 \ |
|
--output backtranslation_output/bt_data --srclang en --tgtlang de \ |
|
backtranslation_output/sampling.shard*.out |
|
|
|
# Ensure lengths are the same: |
|
# wc -l backtranslation_output/bt_data.{en,de} |
|
# 21795614 backtranslation_output/bt_data.en |
|
# 21795614 backtranslation_output/bt_data.de |
|
# 43591228 total |
|
``` |
|
|
|
Binarize the filtered BT data and combine it with the parallel data: |
|
```bash |
|
TEXT=backtranslation_output |
|
fairseq-preprocess \ |
|
--source-lang en --target-lang de \ |
|
--joined-dictionary \ |
|
--srcdict data-bin/wmt18_en_de/dict.en.txt \ |
|
--trainpref $TEXT/bt_data \ |
|
--destdir data-bin/wmt18_en_de_bt \ |
|
--workers 20 |
|
|
|
# We want to train on the combined data, so we'll symlink the parallel + BT data |
|
# in the wmt18_en_de_para_plus_bt directory. We link the parallel data as "train" |
|
# and the BT data as "train1", so that fairseq will combine them automatically |
|
# and so that we can use the `--upsample-primary` option to upsample the |
|
# parallel data (if desired). |
|
PARA_DATA=$(readlink -f data-bin/wmt18_en_de) |
|
BT_DATA=$(readlink -f data-bin/wmt18_en_de_bt) |
|
COMB_DATA=data-bin/wmt18_en_de_para_plus_bt |
|
mkdir -p $COMB_DATA |
|
for LANG in en de; do \ |
|
ln -s ${PARA_DATA}/dict.$LANG.txt ${COMB_DATA}/dict.$LANG.txt; \ |
|
for EXT in bin idx; do \ |
|
ln -s ${PARA_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train.en-de.$LANG.$EXT; \ |
|
ln -s ${BT_DATA}/train.en-de.$LANG.$EXT ${COMB_DATA}/train1.en-de.$LANG.$EXT; \ |
|
ln -s ${PARA_DATA}/valid.en-de.$LANG.$EXT ${COMB_DATA}/valid.en-de.$LANG.$EXT; \ |
|
ln -s ${PARA_DATA}/test.en-de.$LANG.$EXT ${COMB_DATA}/test.en-de.$LANG.$EXT; \ |
|
done; \ |
|
done |
|
``` |
|
|
|
|
|
#### 3. Train an English-German model over the combined parallel + BT data |
|
|
|
Finally we can train a model over the parallel + BT data: |
|
```bash |
|
CHECKPOINT_DIR=checkpoints_en_de_parallel_plus_bt |
|
fairseq-train --fp16 \ |
|
data-bin/wmt18_en_de_para_plus_bt \ |
|
--upsample-primary 16 \ |
|
--source-lang en --target-lang de \ |
|
--arch transformer_wmt_en_de_big --share-all-embeddings \ |
|
--dropout 0.3 --weight-decay 0.0 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ |
|
--lr 0.0007 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ |
|
--max-tokens 3584 --update-freq 16 \ |
|
--max-update 100000 \ |
|
--save-dir $CHECKPOINT_DIR |
|
# Note: the above command assumes 8 GPUs. Adjust `--update-freq` if you have a |
|
# different number of GPUs. |
|
``` |
|
|
|
Average the last 10 checkpoints: |
|
```bash |
|
python scripts/average_checkpoints.py \ |
|
--inputs $CHECKPOINT_DIR \ |
|
--num-epoch-checkpoints 10 \ |
|
--output $CHECKPOINT_DIR/checkpoint.avg10.pt |
|
``` |
|
|
|
Evaluate BLEU: |
|
```bash |
|
# tokenized BLEU on newstest2017: |
|
bash examples/backtranslation/tokenized_bleu.sh \ |
|
wmt17 \ |
|
en-de \ |
|
data-bin/wmt18_en_de \ |
|
data-bin/wmt18_en_de/code \ |
|
$CHECKPOINT_DIR/checkpoint.avg10.pt |
|
# BLEU4 = 32.35, 64.4/38.9/26.2/18.3 (BP=0.977, ratio=0.977, syslen=60729, reflen=62152) |
|
# compare to 32.35 in Table 1, which is also for tokenized BLEU |
|
|
|
# generally it's better to report (detokenized) sacrebleu: |
|
bash examples/backtranslation/sacrebleu.sh \ |
|
wmt17 \ |
|
en-de \ |
|
data-bin/wmt18_en_de \ |
|
data-bin/wmt18_en_de/code \ |
|
$CHECKPOINT_DIR/checkpoint.avg10.pt |
|
# BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.4.3 = 31.5 64.3/38.2/25.6/17.6 (BP = 0.971 ratio = 0.971 hyp_len = 59515 ref_len = 61287) |
|
``` |
|
|
|
|
|
## Citation |
|
```bibtex |
|
@inproceedings{edunov2018backtranslation, |
|
title = {Understanding Back-Translation at Scale}, |
|
author = {Edunov, Sergey and Ott, Myle and Auli, Michael and Grangier, David}, |
|
booktitle = {Conference of the Association for Computational Linguistics (ACL)}, |
|
year = 2018, |
|
} |
|
``` |
|
|