Translation
This example is for training for the WMT'14 English to German news translation task. It will use on the fly tokenization with sentencepiece and sacrebleu for evaluation.
Step 0: Download the data and prepare the subwords model
Preliminary steps are defined in the examples/scripts/prepare_wmt_data.sh
. The following command will download the necessary datasets, and prepare a sentencepiece model:
chmod u+x prepare_wmt_data.sh
./prepare_wmt_data.sh
Note: you should have installed sentencepiece binaries before running this script.
Step 1. Build the vocabulary.
We need to setup the desired configuration with 1. the data 2. the tokenization options:
# wmt14_en_de.yaml
save_data: data/wmt/run/example
## Where the vocab(s) will be written
src_vocab: data/wmt/run/example.vocab.src
tgt_vocab: data/wmt/run/example.vocab.tgt
# Corpus opts:
data:
commoncrawl:
path_src: data/wmt/commoncrawl.de-en.en
path_tgt: data/wmt/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 23
europarl:
path_src: data/wmt/europarl-v7.de-en.en
path_tgt: data/wmt/europarl-v7.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 19
news_commentary:
path_src: data/wmt/news-commentary-v11.de-en.en
path_tgt: data/wmt/news-commentary-v11.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 3
valid:
path_src: data/wmt/valid.en
path_tgt: data/wmt/valid.de
transforms: [sentencepiece]
### Transform related opts:
#### Subword
src_subword_model: data/wmt/wmtende.model
tgt_subword_model: data/wmt/wmtende.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 150
tgt_seq_length: 150
# silently ignore empty lines in the data
skip_empty_level: silent
Then we can execute the vocabulary building script. Let's set -n_sample
to -1
to compute the vocabulary over the whole corpora:
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1
Step 2: Train the model
We need to add the following parameters to the YAML configuration:
...
# General opts
save_model: data/wmt/run/model
keep_checkpoint: 50
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 5000
# Batching
queue_size: 10000
bucket_size: 32768
world_size: 2
gpu_ranks: [0, 1]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]
# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true
Step 3: Translate and evaluate
We need to tokenize the testset with the same sentencepiece model as used in training:
spm_encode --model=data/wmt/wmtende.model \
< data/wmt/test.en \
> data/wmt/test.en.sp
spm_encode --model=data/wmt/wmtende.model \
< data/wmt/test.de \
> data/wmt/test.de.sp
We can translate the testset with the following command:
for checkpoint in data/wmt/run/model_step*.pt; do
echo "# Translating with checkpoint $checkpoint"
base=$(basename $checkpoint)
onmt_translate \
-gpu 0 \
-batch_size 16384 -batch_type tokens \
-beam_size 5 \
-model $checkpoint \
-src data/wmt/test.en.sp \
-tgt data/wmt/test.de.sp \
-output data/wmt/test.de.hyp_${base%.*}.sp
done
Prior to evaluation, we need to detokenize the hypothesis:
for checkpoint in data/wmt/run/model_step*.pt; do
base=$(basename $checkpoint)
spm_decode \
-model=data/wmt/wmtende.model \
-input_format=piece \
< data/wmt/test.de.hyp_${base%.*}.sp \
> data/wmt/test.de.hyp_${base%.*}
done
Finally, we can compute detokenized BLEU with sacrebleu
:
for checkpoint in data/wmt/run/model_step*.pt; do
echo "$checkpoint"
base=$(basename $checkpoint)
sacrebleu data/wmt/test.de < data/wmt/test.de.hyp_${base%.*}
done