|
<p align="center"> |
|
<img src="flores_logo.png" width="500"> |
|
</p> |
|
|
|
# Flores101: Large-Scale Multilingual Machine Translation |
|
|
|
## Introduction |
|
|
|
Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition. |
|
|
|
Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html |
|
|
|
Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/ |
|
|
|
|
|
|
|
## Pretrained models |
|
|
|
Model | Num layers | Embed dimension | FFN dimension| Vocab Size | #params | Download |
|
---|---|---|---|---|---|--- |
|
`flores101_mm100_615M` | 12 | 1024 | 4096 | 256,000 | 615M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz |
|
`flores101_mm100_175M` | 6 | 512 | 2048 | 256,000 | 175M | https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz |
|
|
|
|
|
These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom. |
|
|
|
|
|
## Example Generation code |
|
|
|
### Download model, sentencepiece vocab |
|
|
|
```bash |
|
fairseq=/path/to/fairseq |
|
cd $fairseq |
|
|
|
# Download 615M param model. |
|
wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz |
|
|
|
# Extract |
|
tar -xvzf flores101_mm100_615M.tar.gz |
|
``` |
|
|
|
### Encode using our SentencePiece Model |
|
Note: Install SentencePiece from [here](https://github.com/google/sentencepiece) |
|
|
|
|
|
```bash |
|
fairseq=/path/to/fairseq |
|
cd $fairseq |
|
|
|
# Download example dataset From German to French |
|
sacrebleu --echo src -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.de |
|
sacrebleu --echo ref -l de-fr -t wmt19 | head -n 20 > raw_input.de-fr.fr |
|
|
|
for lang in de fr ; do |
|
python scripts/spm_encode.py \ |
|
--model flores101_mm100_615M/sentencepiece.bpe.model \ |
|
--output_format=piece \ |
|
--inputs=raw_input.de-fr.${lang} \ |
|
--outputs=spm.de-fr.${lang} |
|
done |
|
``` |
|
|
|
### Binarization |
|
|
|
```bash |
|
fairseq-preprocess \ |
|
--source-lang de --target-lang fr \ |
|
--testpref spm.de-fr \ |
|
--thresholdsrc 0 --thresholdtgt 0 \ |
|
--destdir data_bin \ |
|
--srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt |
|
``` |
|
|
|
### Generation |
|
|
|
|
|
```bash |
|
fairseq-generate \ |
|
data_bin \ |
|
--batch-size 1 \ |
|
--path flores101_mm100_615M/model.pt \ |
|
--fixed-dictionary flores101_mm100_615M/dict.txt \ |
|
-s de -t fr \ |
|
--remove-bpe 'sentencepiece' \ |
|
--beam 5 \ |
|
--task translation_multi_simple_epoch \ |
|
--lang-pairs flores101_mm100_615M/language_pairs.txt \ |
|
--decoder-langtok --encoder-langtok src \ |
|
--gen-subset test \ |
|
--fp16 \ |
|
--dataset-impl mmap \ |
|
--distributed-world-size 1 --distributed-no-spawn |
|
``` |
|
|
|
### Supported Languages and lang code |
|
|
|
Language | lang code |
|
---|--- |
|
Akrikaans | af |
|
Amharic | am |
|
Arabic | ar |
|
Assamese | as |
|
Asturian | ast |
|
Aymara | ay |
|
Azerbaijani | az |
|
Bashkir | ba |
|
Belarusian | be |
|
Bulgarian | bg |
|
Bengali | bn |
|
Breton | br |
|
Bosnian | bs |
|
Catalan | ca |
|
Cebuano | ceb |
|
Chokwe | cjk |
|
Czech | cs |
|
Welsh | cy |
|
Danish | da |
|
German | de |
|
Dyula| dyu |
|
Greek | el |
|
English | en |
|
Spanish | es |
|
Estonian | et |
|
Persian | fa |
|
Fulah | ff |
|
Finnish | fi |
|
French | fr |
|
Western Frisian | fy |
|
Irish | ga |
|
Scottish Gaelic | gd |
|
Galician | gl |
|
Gujarati | gu |
|
Hausa | ha |
|
Hebrew | he |
|
Hindi | hi |
|
Croatian | hr |
|
Haitian Creole | ht |
|
Hungarian | hu |
|
Armenian | hy |
|
Indonesian | id |
|
Igbo | ig |
|
Iloko | ilo |
|
Icelandic | is |
|
Italian | it |
|
Japanese | ja |
|
Javanese | jv |
|
Georgian | ka |
|
Kachin | kac |
|
Kamba | kam |
|
Kabuverdianu | kea |
|
Kongo | kg |
|
Kazakh | kk |
|
Central Khmer | km |
|
Kimbundu | kmb |
|
Northern Kurdish | kmr |
|
Kannada | kn |
|
Korean | ko |
|
Kurdish | ku |
|
Kyrgyz | ky |
|
Luxembourgish | lb |
|
Ganda | lg |
|
Lingala | ln |
|
Lao | lo |
|
Lithuanian | lt |
|
Luo | luo |
|
Latvian | lv |
|
Malagasy | mg |
|
Maori | mi |
|
Macedonian | mk |
|
Malayalam | ml |
|
Mongolian | mn |
|
Marathi | mr |
|
Malay | ms |
|
Maltese | mt |
|
Burmese | my |
|
Nepali | ne |
|
Dutch | nl |
|
Norwegian | no |
|
Northern Sotho | ns |
|
Nyanja | ny |
|
Occitan | oc |
|
Oromo | om |
|
Oriya | or |
|
Punjabi | pa |
|
Polish | pl |
|
Pashto | ps |
|
Portuguese | pt |
|
Quechua | qu |
|
Romanian | ro |
|
Russian | ru |
|
Sindhi | sd |
|
Shan | shn |
|
Sinhala | si |
|
Slovak | sk |
|
Slovenian | sl |
|
Shona | sn |
|
Somali | so |
|
Albanian | sq |
|
Serbian | sr |
|
Swati | ss |
|
Sundanese | su |
|
Swedish | sv |
|
Swahili | sw |
|
Tamil | ta |
|
Telugu | te |
|
Tajik | tg |
|
Thai | th |
|
Tigrinya | ti |
|
Tagalog | tl |
|
Tswana | tn |
|
Turkish | tr |
|
Ukrainian | uk |
|
Umbundu | umb |
|
Urdu | ur |
|
Uzbek | uz |
|
Vietnamese | vi |
|
Wolof | wo |
|
Xhosa | xh |
|
Yiddish | yi |
|
Yoruba | yo |
|
Chinese| zh |
|
Zulu | zu |
|
|