File size: 5,305 Bytes
26fd00c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# LASER Language-Agnostic SEntence Representations
LASER is a library to calculate and use multilingual sentence embeddings.
You can find more information about LASER and how to use it on the official [LASER repository](https://github.com/facebookresearch/LASER).
This folder contains source code for training LASER embeddings.
## Prepare data and configuration file
Binarize your data with fairseq, as described [here](https://fairseq.readthedocs.io/en/latest/getting_started.html#data-pre-processing).
Create a json config file with this format:
```
{
"src_vocab": "/path/to/spm.src.cvocab",
"tgt_vocab": "/path/to/spm.tgt.cvocab",
"train": [
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang1-tgtlang0/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang1-tgtlang1/train.srclang1",
"tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
},
{
"type": "translation",
"id": 0,
"src": "/path/to/srclang2-tgtlang0/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
},
{
"type": "translation",
"id": 1,
"src": "/path/to/srclang2-tgtlang1/train.srclang2",
"tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
},
...
],
"valid": [
{
"type": "translation",
"id": 0,
"src": "/unused",
"tgt": "/unused"
}
]
}
```
where paths are paths to binarized indexed fairseq dataset files.
`id` represents the target language id.
## Training Command Line Example
```
fairseq-train \
/path/to/configfile_described_above.json \
--user-dir examples/laser/laser_src \
--log-interval 100 --log-format simple \
--task laser --arch laser_lstm \
--save-dir . \
--optimizer adam \
--lr 0.001 \
--lr-scheduler inverse_sqrt \
--clip-norm 5 \
--warmup-updates 90000 \
--update-freq 2 \
--dropout 0.0 \
--encoder-dropout-out 0.1 \
--max-tokens 2000 \
--max-epoch 50 \
--encoder-bidirectional \
--encoder-layers 5 \
--encoder-hidden-size 512 \
--decoder-layers 1 \
--decoder-hidden-size 2048 \
--encoder-embed-dim 320 \
--decoder-embed-dim 320 \
--decoder-lang-embed-dim 32 \
--warmup-init-lr 0.001 \
--disable-validation
```
## Applications
We showcase several applications of multilingual sentence embeddings
with code to reproduce our results (in the directory "tasks").
* [**Cross-lingual document classification**](https://github.com/facebookresearch/LASER/tree/master/tasks/mldoc) using the
[*MLDoc*](https://github.com/facebookresearch/MLDoc) corpus [2,6]
* [**WikiMatrix**](https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix)
Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
* [**Bitext mining**](https://github.com/facebookresearch/LASER/tree/master/tasks/bucc) using the
[*BUCC*](https://comparable.limsi.fr/bucc2018/bucc2018-task.html) corpus [3,5]
* [**Cross-lingual NLI**](https://github.com/facebookresearch/LASER/tree/master/tasks/xnli)
using the [*XNLI*](https://www.nyu.edu/projects/bowman/xnli/) corpus [4,5,6]
* [**Multilingual similarity search**](https://github.com/facebookresearch/LASER/tree/master/tasks/similarity) [1,6]
* [**Sentence embedding of text files**](https://github.com/facebookresearch/LASER/tree/master/tasks/embed)
example how to calculate sentence embeddings for arbitrary text files in any of the supported language.
**For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.**
## References
[1] Holger Schwenk and Matthijs Douze,
[*Learning Joint Multilingual Sentence Representations with Neural Machine Translation*](https://aclanthology.info/papers/W17-2619/w17-2619),
ACL workshop on Representation Learning for NLP, 2017
[2] Holger Schwenk and Xian Li,
[*A Corpus for Multilingual Document Classification in Eight Languages*](http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf),
LREC, pages 3548-3551, 2018.
[3] Holger Schwenk,
[*Filtering and Mining Parallel Data in a Joint Multilingual Space*](http://aclweb.org/anthology/P18-2037)
ACL, July 2018
[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov,
[*XNLI: Cross-lingual Sentence Understanding through Inference*](https://aclweb.org/anthology/D18-1269),
EMNLP, 2018.
[5] Mikel Artetxe and Holger Schwenk,
[*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136)
arXiv, Nov 3 2018.
[6] Mikel Artetxe and Holger Schwenk,
[*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464)
arXiv, Dec 26 2018.
[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman,
[*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791)
arXiv, July 11 2019.
[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin
[*CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB*](https://arxiv.org/abs/1911.04944)
|