Spaces:

mshukor
/

UnIVAL

Sleeping

App Files Files Community

UnIVAL / fairseq /examples /laser /README.md

mshukor

init

26fd00c over 1 year ago

preview code

raw

history blame

5.31 kB

	# LASER Language-Agnostic SEntence Representations

	LASER is a library to calculate and use multilingual sentence embeddings.

	You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER).

	This folder contains source code for training LASER embeddings.


	## Prepare data and configuration file

	Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing).

	Create a json config file with this format:
	```
	{
	"src_vocab": "/path/to/spm.src.cvocab",
	"tgt_vocab": "/path/to/spm.tgt.cvocab",
	"train": [
	{
	"type": "translation",
	"id": 0,
	"src": "/path/to/srclang1-tgtlang0/train.srclang1",
	"tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
	},
	{
	"type": "translation",
	"id": 1,
	"src": "/path/to/srclang1-tgtlang1/train.srclang1",
	"tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
	},
	{
	"type": "translation",
	"id": 0,
	"src": "/path/to/srclang2-tgtlang0/train.srclang2",
	"tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
	},
	{
	"type": "translation",
	"id": 1,
	"src": "/path/to/srclang2-tgtlang1/train.srclang2",
	"tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
	},
	...
	],
	"valid": [
	{
	"type": "translation",
	"id": 0,
	"src": "/unused",
	"tgt": "/unused"
	}
	]
	}
	```
	where paths are paths to binarized indexed fairseq dataset files.
	`id` represents the target language id.


	## Training Command Line Example

	```
	fairseq-train \
	/path/to/configfile_described_above.json \
	--user-dir examples/laser/laser_src \
	--log-interval 100 --log-format simple \
	--task laser --arch laser_lstm \
	--save-dir . \
	--optimizer adam \
	--lr 0.001 \
	--lr-scheduler inverse_sqrt \
	--clip-norm 5 \
	--warmup-updates 90000 \
	--update-freq 2 \
	--dropout 0.0 \
	--encoder-dropout-out 0.1 \
	--max-tokens 2000 \
	--max-epoch 50 \
	--encoder-bidirectional \
	--encoder-layers 5 \
	--encoder-hidden-size 512 \
	--decoder-layers 1 \
	--decoder-hidden-size 2048 \
	--encoder-embed-dim 320 \
	--decoder-embed-dim 320 \
	--decoder-lang-embed-dim 32 \
	--warmup-init-lr 0.001 \
	--disable-validation
	```


	## Applications

	We showcase several applications of multilingual sentence embeddings
	with code to reproduce our results (in the directory "tasks").

	* [Cross-lingual document classification](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the
	[MLDoc](https://github.com/facebookresearch/MLDoc) corpus [2,6]
	* [WikiMatrix](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
	Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
	* [Bitext mining](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the
	[BUCC](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5]
	* [Cross-lingual NLI](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli)
	using the [XNLI](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6]
	* [Multilingual similarity search](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6]
	* [Sentence embedding of text files](https://github.com/facebookresearch/LASER/tree/master/tasks/embed)
	example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

	For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.



	## References

	[1] Holger Schwenk and Matthijs Douze,
	[Learning Joint Multilingual Sentence Representations with Neural Machine Translation](https://aclanthology.info/papers/W17-2619/w17-2619),
	ACL workshop on Representation Learning for NLP, 2017

	[2] Holger Schwenk and Xian Li,
	[A Corpus for Multilingual Document Classification in Eight Languages](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf),
	LREC, pages 3548-3551, 2018.

	[3] Holger Schwenk,
	[Filtering and Mining Parallel Data in a Joint Multilingual Space](http://aclweb.org/anthology/P18-2037)
	ACL, July 2018

	[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov,
	[XNLI: Cross-lingual Sentence Understanding through Inference](https://aclweb.org/anthology/D18-1269),
	EMNLP, 2018.

	[5] Mikel Artetxe and Holger Schwenk,
	[Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings](https://arxiv.org/abs/1811.01136)
	arXiv, Nov 3 2018.

	[6] Mikel Artetxe and Holger Schwenk,
	[Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/abs/1812.10464)
	arXiv, Dec 26 2018.

	[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman,
	[WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia](https://arxiv.org/abs/1907.05791)
	arXiv, July 11 2019.

	[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin
	[CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB](https://arxiv.org/abs/1911.04944)