|
# Cross-lingual Retrieval for Iterative Self-Supervised Training |
|
|
|
https://arxiv.org/pdf/2006.09526.pdf |
|
|
|
## Introduction |
|
|
|
CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. |
|
|
|
## Requirements: |
|
|
|
* faiss: https://github.com/facebookresearch/faiss |
|
* mosesdecoder: https://github.com/moses-smt/mosesdecoder |
|
* flores: https://github.com/facebookresearch/flores |
|
* LASER: https://github.com/facebookresearch/LASER |
|
|
|
## Unsupervised Machine Translation |
|
##### 1. Download and decompress CRISS checkpoints |
|
``` |
|
cd examples/criss |
|
wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz |
|
tar -xf criss_checkpoints.tar.gz |
|
``` |
|
##### 2. Download and preprocess Flores test dataset |
|
Make sure to run all scripts from examples/criss directory |
|
``` |
|
bash download_and_preprocess_flores_test.sh |
|
``` |
|
|
|
##### 3. Run Evaluation on Sinhala-English |
|
``` |
|
bash unsupervised_mt/eval.sh |
|
``` |
|
|
|
## Sentence Retrieval |
|
##### 1. Download and preprocess Tatoeba dataset |
|
``` |
|
bash download_and_preprocess_tatoeba.sh |
|
``` |
|
|
|
##### 2. Run Sentence Retrieval on Tatoeba Kazakh-English |
|
``` |
|
bash sentence_retrieval/sentence_retrieval_tatoeba.sh |
|
``` |
|
|
|
## Mining |
|
##### 1. Install faiss |
|
Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md |
|
##### 2. Mine pseudo-parallel data between Kazakh and English |
|
``` |
|
bash mining/mine_example.sh |
|
``` |
|
|
|
## Citation |
|
```bibtex |
|
@article{tran2020cross, |
|
title={Cross-lingual retrieval for iterative self-supervised training}, |
|
author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao}, |
|
journal={arXiv preprint arXiv:2006.09526}, |
|
year={2020} |
|
} |
|
``` |
|
|