Add files using upload-large-folder tool

77d5bb2 verified 13 days ago

4.33 kB

	# An example of English to Japaneses Simultaneous Translation System

	This is an example of training and evaluating a transformer wait-k English to Japanese simultaneous text-to-text translation model.

	## Data Preparation
	This section introduces the data preparation for training and evaluation.
	If you only want to evaluate the model, please jump to [Inference & Evaluation](#inference-&-evaluation)

	For illustration, we only use the following subsets of the available data from [WMT20 news translation task](http://www.statmt.org/wmt20/translation-task.html), which results in 7,815,391 sentence pairs.
	- News Commentary v16
	- Wiki Titles v3
	- WikiMatrix V1
	- Japanese-English Subtitle Corpus
	- The Kyoto Free Translation Task Corpus

	We use WMT20 development data as development set. Training `transformer_vaswani_wmt_en_de_big` model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.

	We use [sentencepiece](https://github.com/google/sentencepiece) toolkit to tokenize the data with a vocabulary size of 32000.
	Additionally, we filtered out the sentences longer than 200 words after tokenization.
	Assuming the tokenized text data is saved at `${DATA_DIR}`,
	we prepare the data binary with the following command.

	```bash
	fairseq-preprocess \
	--source-lang en --target-lang ja \
	--trainpref ${DATA_DIR}/train \
	--validpref ${DATA_DIR}/dev \
	--testpref ${DATA_DIR}/test \
	--destdir ${WMT20_ENJA_DATA_BIN} \
	--nwordstgt 32000 --nwordssrc 32000 \
	--workers 20
	```

	## Simultaneous Translation Model Training
	To train a wait-k `(k=10)` model.
	```bash
	fairseq-train ${WMT20_ENJA_DATA_BIN} \
	--save-dir ${SAVEDIR}
	--simul-type waitk \
	--waitk-lagging 10 \
	--max-epoch 70 \
	--arch transformer_monotonic_vaswani_wmt_en_de_big \
	--optimizer adam \
	--adam-betas '(0.9, 0.98)' \
	--lr-scheduler inverse_sqrt \
	--warmup-init-lr 1e-07 \
	--warmup-updates 4000 \
	--lr 0.0005 \
	--stop-min-lr 1e-09 \
	--clip-norm 10.0 \
	--dropout 0.3 \
	--weight-decay 0.0 \
	--criterion label_smoothed_cross_entropy \
	--label-smoothing 0.1 \
	--max-tokens 3584
	```
	This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with `--update-freq 8`.

	## Inference & Evaluation
	First of all, install [SimulEval](https://github.com/facebookresearch/SimulEval) for evaluation.

	```bash
	git clone https://github.com/facebookresearch/SimulEval.git
	cd SimulEval
	pip install -e .
	```

	The following command is for the evaluation.
	Assuming the source and reference files are `${SRC_FILE}` and `${REF_FILE}`, the sentencepiece model file for English is saved at `${SRC_SPM_PATH}`


	```bash
	simuleval \
	--source ${SRC_FILE} \
	--target ${TGT_FILE} \
	--data-bin ${WMT20_ENJA_DATA_BIN} \
	--sacrebleu-tokenizer ja-mecab \
	--eval-latency-unit char \
	--no-space \
	--src-splitter-type sentencepiecemodel \
	--src-splitter-path ${SRC_SPM_PATH} \
	--agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
	--model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
	--output ${OUTPUT} \
	--scores
	```

	The `--data-bin` should be the same in previous sections if you prepare the data from the scratch.
	If only for evaluation, a prepared data directory can be found [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_databin.tgz) and a pretrained checkpoint (wait-k=10 model) can be downloaded from [here](https://dl.fbaipublicfiles.com/simultaneous_translation/wmt20_enja_medium_wait10_ckpt.pt).

	The output should look like this:
	```bash
	{
	"Quality": {
	"BLEU": 11.442253287568398
	},
	"Latency": {
	"AL": 8.6587861866951,
	"AP": 0.7863304776251316,
	"DAL": 9.477850951194764
	}
	}
	```
	The latency is evaluated by characters (`--eval-latency-unit`) on the target side. The latency is evaluated with `sacrebleu` with `MeCab` tokenizer `--sacrebleu-tokenizer ja-mecab`. `--no-space` indicates that do not add space when merging the predicted words.

	If `--output ${OUTPUT}` option is used, the detailed log and scores will be stored under the `${OUTPUT}` directory.