jiang
init commit
650c5f6
|
raw
history blame
4.33 kB

An example of English to Japaneses Simultaneous Translation System

This is an example of training and evaluating a transformer wait-k English to Japanese simultaneous text-to-text translation model.

Data Preparation

This section introduces the data preparation for training and evaluation. If you only want to evaluate the model, please jump to Inference & Evaluation

For illustration, we only use the following subsets of the available data from WMT20 news translation task, which results in 7,815,391 sentence pairs.

  • News Commentary v16
  • Wiki Titles v3
  • WikiMatrix V1
  • Japanese-English Subtitle Corpus
  • The Kyoto Free Translation Task Corpus

We use WMT20 development data as development set. Training transformer_vaswani_wmt_en_de_big model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.

We use sentencepiece toolkit to tokenize the data with a vocabulary size of 32000. Additionally, we filtered out the sentences longer than 200 words after tokenization. Assuming the tokenized text data is saved at ${DATA_DIR}, we prepare the data binary with the following command.

fairseq-preprocess \
    --source-lang en --target-lang ja \
    --trainpref ${DATA_DIR}/train \
    --validpref ${DATA_DIR}/dev \
    --testpref ${DATA_DIR}/test \
    --destdir ${WMT20_ENJA_DATA_BIN} \
    --nwordstgt 32000 --nwordssrc 32000 \
    --workers 20

Simultaneous Translation Model Training

To train a wait-k (k=10) model.

fairseq-train ${WMT20_ENJA_DATA_BIN}  \
    --save-dir ${SAVEDIR}
    --simul-type waitk  \
    --waitk-lagging 10  \
    --max-epoch 70  \
    --arch transformer_monotonic_vaswani_wmt_en_de_big \
    --optimizer adam  \
    --adam-betas '(0.9, 0.98)'  \
    --lr-scheduler inverse_sqrt  \
    --warmup-init-lr 1e-07  \
    --warmup-updates 4000  \
    --lr 0.0005  \
    --stop-min-lr 1e-09  \
    --clip-norm 10.0  \
    --dropout 0.3  \
    --weight-decay 0.0  \
    --criterion label_smoothed_cross_entropy  \
    --label-smoothing 0.1  \
    --max-tokens 3584

This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with --update-freq 8.

Inference & Evaluation

First of all, install SimulEval for evaluation.

git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .

The following command is for the evaluation. Assuming the source and reference files are ${SRC_FILE} and ${REF_FILE}, the sentencepiece model file for English is saved at ${SRC_SPM_PATH}

simuleval \
    --source ${SRC_FILE} \
    --target ${TGT_FILE} \
    --data-bin ${WMT20_ENJA_DATA_BIN} \
    --sacrebleu-tokenizer ja-mecab \
    --eval-latency-unit char \
    --no-space \
    --src-splitter-type sentencepiecemodel \
    --src-splitter-path ${SRC_SPM_PATH} \
    --agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
    --model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
    --output ${OUTPUT} \
    --scores

The --data-bin should be the same in previous sections if you prepare the data from the scratch. If only for evaluation, a prepared data directory can be found here and a pretrained checkpoint (wait-k=10 model) can be downloaded from here.

The output should look like this:

{
    "Quality": {
        "BLEU": 11.442253287568398
    },
    "Latency": {
        "AL": 8.6587861866951,
        "AP": 0.7863304776251316,
        "DAL": 9.477850951194764
    }
}

The latency is evaluated by characters (--eval-latency-unit) on the target side. The latency is evaluated with sacrebleu with MeCab tokenizer --sacrebleu-tokenizer ja-mecab. --no-space indicates that do not add space when merging the predicted words.

If --output ${OUTPUT} option is used, the detailed log and scores will be stored under the ${OUTPUT} directory.