|
# Examples of Training scripts for Non-autoregressive Machine Translation models |
|
|
|
### Non-autoregressive Transformer (NAT, Gu et al., 2017) |
|
Note that we need to have an additional module to perform "length prediction" (`--length-loss-factor`) before generating the whole sequence. |
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch nonautoregressive_transformer \ |
|
--noise full_mask \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--pred-length-offset \ |
|
--length-loss-factor 0.1 \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|
|
### Fast Structured Decoding for Sequence Models (NAT-CRF, Sun et al., 2019) |
|
Note that we implemented a low-rank appromixated CRF model by setting `--crf-lowrank-approx=32` and `--crf-beam-approx=64` as discribed in the original paper. All other settings are the same as the vanilla NAT model. |
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch nacrf_transformer \ |
|
--noise full_mask \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--pred-length-offset \ |
|
--length-loss-factor 0.1 \ |
|
--word-ins-loss-factor 0.5 \ |
|
--crf-lowrank-approx 32 \ |
|
--crf-beam-approx 64 \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|
|
|
|
### Non-autoregressive Transformer with Iterative Refinement (iNAT, Lee et al., 2018) |
|
Note that `--train-step` means how many iterations of refinement we used during training, and `--dae-ratio` controls the ratio of denoising auto-encoder training described in the original paper. |
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch iterative_nonautoregressive_transformer \ |
|
--noise full_mask \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--pred-length-offset \ |
|
--length-loss-factor 0.1 \ |
|
--train-step 4 \ |
|
--dae-ratio 0.5 \ |
|
--stochastic-approx \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|
|
### Insertion Transformer (InsT, Stern et al., 2019) |
|
Note that we need to specify the "slot-loss" (uniform or balanced tree) described in the original paper. Here we use `--label-tau` to control the temperature. |
|
|
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch insertion_transformer \ |
|
--noise random_delete \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|
|
|
|
### Mask Predict (CMLM, Ghazvininejad et al., 2019) |
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch cmlm_transformer \ |
|
--noise random_mask \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|
|
|
|
|
|
|
|
### Levenshtein Transformer (LevT, Gu et al., 2019) |
|
```bash |
|
fairseq-train \ |
|
data-bin/wmt14_en_de_distill \ |
|
--save-dir checkpoints \ |
|
--ddp-backend=legacy_ddp \ |
|
--task translation_lev \ |
|
--criterion nat_loss \ |
|
--arch levenshtein_transformer \ |
|
--noise random_delete \ |
|
--share-all-embeddings \ |
|
--optimizer adam --adam-betas '(0.9,0.98)' \ |
|
--lr 0.0005 --lr-scheduler inverse_sqrt \ |
|
--stop-min-lr '1e-09' --warmup-updates 10000 \ |
|
--warmup-init-lr '1e-07' --label-smoothing 0.1 \ |
|
--dropout 0.3 --weight-decay 0.01 \ |
|
--decoder-learned-pos \ |
|
--encoder-learned-pos \ |
|
--apply-bert-init \ |
|
--log-format 'simple' --log-interval 100 \ |
|
--fixed-validation-seed 7 \ |
|
--max-tokens 8000 \ |
|
--save-interval-updates 10000 \ |
|
--max-update 300000 |
|
``` |
|
|