|
|
|
# Translation |
|
|
|
This example is for training for the [WMT'14 English to German news translation task](https://www.statmt.org/wmt14/translation-task.html). It will use on the fly tokenization with [sentencepiece](https://github.com/google/sentencepiece) and [sacrebleu](https://github.com/mjpost/sacrebleu) for evaluation. |
|
|
|
|
|
## Step 0: Download the data and prepare the subwords model |
|
|
|
Preliminary steps are defined in the [`examples/scripts/prepare_wmt_data.sh`](https://github.com/OpenNMT/OpenNMT-py/tree/master/examples/scripts/prepare_wmt_data.sh). The following command will download the necessary datasets, and prepare a sentencepiece model: |
|
```bash |
|
chmod u+x prepare_wmt_data.sh |
|
./prepare_wmt_data.sh |
|
``` |
|
|
|
Note: you should have installed [sentencepiece](https://github.com/google/sentencepiece) binaries before running this script. |
|
|
|
## Step 1. Build the vocabulary. |
|
|
|
We need to setup the desired configuration with 1. the data 2. the tokenization options: |
|
|
|
```yaml |
|
# wmt14_en_de.yaml |
|
save_data: data/wmt/run/example |
|
## Where the vocab(s) will be written |
|
src_vocab: data/wmt/run/example.vocab.src |
|
tgt_vocab: data/wmt/run/example.vocab.tgt |
|
|
|
# Corpus opts: |
|
data: |
|
commoncrawl: |
|
path_src: data/wmt/commoncrawl.de-en.en |
|
path_tgt: data/wmt/commoncrawl.de-en.de |
|
transforms: [sentencepiece, filtertoolong] |
|
weight: 23 |
|
europarl: |
|
path_src: data/wmt/europarl-v7.de-en.en |
|
path_tgt: data/wmt/europarl-v7.de-en.de |
|
transforms: [sentencepiece, filtertoolong] |
|
weight: 19 |
|
news_commentary: |
|
path_src: data/wmt/news-commentary-v11.de-en.en |
|
path_tgt: data/wmt/news-commentary-v11.de-en.de |
|
transforms: [sentencepiece, filtertoolong] |
|
weight: 3 |
|
valid: |
|
path_src: data/wmt/valid.en |
|
path_tgt: data/wmt/valid.de |
|
transforms: [sentencepiece] |
|
|
|
### Transform related opts: |
|
#### Subword |
|
src_subword_model: data/wmt/wmtende.model |
|
tgt_subword_model: data/wmt/wmtende.model |
|
src_subword_nbest: 1 |
|
src_subword_alpha: 0.0 |
|
tgt_subword_nbest: 1 |
|
tgt_subword_alpha: 0.0 |
|
#### Filter |
|
src_seq_length: 150 |
|
tgt_seq_length: 150 |
|
|
|
# silently ignore empty lines in the data |
|
skip_empty_level: silent |
|
|
|
``` |
|
|
|
Then we can execute the vocabulary building script. Let's set `-n_sample` to `-1` to compute the vocabulary over the whole corpora: |
|
|
|
```bash |
|
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1 |
|
``` |
|
|
|
## Step 2: Train the model |
|
|
|
We need to add the following parameters to the YAML configuration: |
|
|
|
```yaml |
|
... |
|
|
|
# General opts |
|
save_model: data/wmt/run/model |
|
keep_checkpoint: 50 |
|
save_checkpoint_steps: 5000 |
|
average_decay: 0.0005 |
|
seed: 1234 |
|
report_every: 100 |
|
train_steps: 100000 |
|
valid_steps: 5000 |
|
|
|
# Batching |
|
queue_size: 10000 |
|
bucket_size: 32768 |
|
world_size: 2 |
|
gpu_ranks: [0, 1] |
|
batch_type: "tokens" |
|
batch_size: 4096 |
|
valid_batch_size: 16 |
|
batch_size_multiple: 1 |
|
max_generator_batches: 0 |
|
accum_count: [3] |
|
accum_steps: [0] |
|
|
|
# Optimization |
|
model_dtype: "fp32" |
|
optim: "adam" |
|
learning_rate: 2 |
|
warmup_steps: 8000 |
|
decay_method: "noam" |
|
adam_beta2: 0.998 |
|
max_grad_norm: 0 |
|
label_smoothing: 0.1 |
|
param_init: 0 |
|
param_init_glorot: true |
|
normalization: "tokens" |
|
|
|
# Model |
|
encoder_type: transformer |
|
decoder_type: transformer |
|
enc_layers: 6 |
|
dec_layers: 6 |
|
heads: 8 |
|
rnn_size: 512 |
|
word_vec_size: 512 |
|
transformer_ff: 2048 |
|
dropout_steps: [0] |
|
dropout: [0.1] |
|
attention_dropout: [0.1] |
|
share_decoder_embeddings: true |
|
share_embeddings: true |
|
``` |
|
|
|
## Step 3: Translate and evaluate |
|
|
|
We need to tokenize the testset with the same sentencepiece model as used in training: |
|
|
|
```bash |
|
spm_encode --model=data/wmt/wmtende.model \ |
|
< data/wmt/test.en \ |
|
> data/wmt/test.en.sp |
|
spm_encode --model=data/wmt/wmtende.model \ |
|
< data/wmt/test.de \ |
|
> data/wmt/test.de.sp |
|
``` |
|
|
|
We can translate the testset with the following command: |
|
|
|
```bash |
|
for checkpoint in data/wmt/run/model_step*.pt; do |
|
echo "# Translating with checkpoint $checkpoint" |
|
base=$(basename $checkpoint) |
|
onmt_translate \ |
|
-gpu 0 \ |
|
-batch_size 16384 -batch_type tokens \ |
|
-beam_size 5 \ |
|
-model $checkpoint \ |
|
-src data/wmt/test.en.sp \ |
|
-tgt data/wmt/test.de.sp \ |
|
-output data/wmt/test.de.hyp_${base%.*}.sp |
|
done |
|
``` |
|
|
|
Prior to evaluation, we need to detokenize the hypothesis: |
|
|
|
```bash |
|
for checkpoint in data/wmt/run/model_step*.pt; do |
|
base=$(basename $checkpoint) |
|
spm_decode \ |
|
-model=data/wmt/wmtende.model \ |
|
-input_format=piece \ |
|
< data/wmt/test.de.hyp_${base%.*}.sp \ |
|
> data/wmt/test.de.hyp_${base%.*} |
|
done |
|
``` |
|
|
|
|
|
Finally, we can compute detokenized BLEU with `sacrebleu`: |
|
|
|
```bash |
|
for checkpoint in data/wmt/run/model_step*.pt; do |
|
echo "$checkpoint" |
|
base=$(basename $checkpoint) |
|
sacrebleu data/wmt/test.de < data/wmt/test.de.hyp_${base%.*} |
|
done |
|
``` |
|
|