ash56
/

ssl-aasist

Model card Files Files and versions Community

ssl-aasist / fairseq /examples /criss /README.md

ash56's picture

Add files using upload-large-folder tool

878264b verified 12 days ago

|

1.73 kB

	# Cross-lingual Retrieval for Iterative Self-Supervised Training

	https://arxiv.org/pdf/2006.09526.pdf

	## Introduction

	CRISS is a multilingual sequence-to-sequnce pretraining method where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time.

	## Requirements:

	* faiss: https://github.com/facebookresearch/faiss
	* mosesdecoder: https://github.com/moses-smt/mosesdecoder
	* flores: https://github.com/facebookresearch/flores
	* LASER: https://github.com/facebookresearch/LASER

	## Unsupervised Machine Translation
	##### 1. Download and decompress CRISS checkpoints
	```
	cd examples/criss
	wget https://dl.fbaipublicfiles.com/criss/criss_3rd_checkpoints.tar.gz
	tar -xf criss_checkpoints.tar.gz
	```
	##### 2. Download and preprocess Flores test dataset
	Make sure to run all scripts from examples/criss directory
	```
	bash download_and_preprocess_flores_test.sh
	```

	##### 3. Run Evaluation on Sinhala-English
	```
	bash unsupervised_mt/eval.sh
	```

	## Sentence Retrieval
	##### 1. Download and preprocess Tatoeba dataset
	```
	bash download_and_preprocess_tatoeba.sh
	```

	##### 2. Run Sentence Retrieval on Tatoeba Kazakh-English
	```
	bash sentence_retrieval/sentence_retrieval_tatoeba.sh
	```

	## Mining
	##### 1. Install faiss
	Follow instructions on https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
	##### 2. Mine pseudo-parallel data between Kazakh and English
	```
	bash mining/mine_example.sh
	```

	## Citation
	```bibtex
	@article{tran2020cross,
	title={Cross-lingual retrieval for iterative self-supervised training},
	author={Tran, Chau and Tang, Yuqing and Li, Xian and Gu, Jiatao},
	journal={arXiv preprint arXiv:2006.09526},
	year={2020}
	}
	```