Spaces:

mshukor
/

UnIVAL

Running

App Files Files Community

UnIVAL / fairseq /examples /flores101 /README.md

mshukor

init

26fd00c almost 2 years ago

preview code

raw

history blame contribute delete

4.6 kB

	<p align="center">
	<img src="flores_logo.png" width="500">
	</p>

	# Flores101: Large-Scale Multilingual Machine Translation

	## Introduction

	Baseline pretrained models for small and large tracks of WMT 21 Large-Scale Multilingual Machine Translation competition.

	Flores Task at WMT 21: http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html

	Flores announement blog post: https://ai.facebook.com/blog/flores-researchers-kick-off-multilingual-translation-challenge-at-wmt-and-call-for-compute-grants/



	## Pretrained models

	Model \| Num layers \| Embed dimension \| FFN dimension\| Vocab Size \| #params \| Download
	---\|---\|---\|---\|---\|---\|---
	`flores101_mm100_615M` \| 12 \| 1024 \| 4096 \| 256,000 \| 615M \| https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz
	`flores101_mm100_175M` \| 6 \| 512 \| 2048 \| 256,000 \| 175M \| https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_175M.tar.gz


	These models are trained similar to [M2M-100](https://arxiv.org/abs/2010.11125) with additional support for the languages that are part of the WMT Large-Scale Multilingual Machine Translation track. Full list of languages can be found at the bottom.


	## Example Generation code

	### Download model, sentencepiece vocab

	```bash
	fairseq=/path/to/fairseq
	cd $fairseq

	# Download 615M param model.
	wget https://dl.fbaipublicfiles.com/flores101/pretrained_models/flores101_mm100_615M.tar.gz

	# Extract
	tar -xvzf flores101_mm100_615M.tar.gz
	```

	### Encode using our SentencePiece Model
	Note: Install SentencePiece from [here](https://github.com/google/sentencepiece)


	```bash
	fairseq=/path/to/fairseq
	cd $fairseq

	# Download example dataset From German to French
	sacrebleu --echo src -l de-fr -t wmt19 \| head -n 20 > raw_input.de-fr.de
	sacrebleu --echo ref -l de-fr -t wmt19 \| head -n 20 > raw_input.de-fr.fr

	for lang in de fr ; do
	python scripts/spm_encode.py \
	--model flores101_mm100_615M/sentencepiece.bpe.model \
	--output_format=piece \
	--inputs=raw_input.de-fr.${lang} \
	--outputs=spm.de-fr.${lang}
	done
	```

	### Binarization

	```bash
	fairseq-preprocess \
	--source-lang de --target-lang fr \
	--testpref spm.de-fr \
	--thresholdsrc 0 --thresholdtgt 0 \
	--destdir data_bin \
	--srcdict flores101_mm100_615M/dict.txt --tgtdict flores101_mm100_615M/dict.txt
	```

	### Generation


	```bash
	fairseq-generate \
	data_bin \
	--batch-size 1 \
	--path flores101_mm100_615M/model.pt \
	--fixed-dictionary flores101_mm100_615M/dict.txt \
	-s de -t fr \
	--remove-bpe 'sentencepiece' \
	--beam 5 \
	--task translation_multi_simple_epoch \
	--lang-pairs flores101_mm100_615M/language_pairs.txt \
	--decoder-langtok --encoder-langtok src \
	--gen-subset test \
	--fp16 \
	--dataset-impl mmap \
	--distributed-world-size 1 --distributed-no-spawn
	```

	### Supported Languages and lang code

	Language \| lang code
	---\|---
	Akrikaans \| af
	Amharic \| am
	Arabic \| ar
	Assamese \| as
	Asturian \| ast
	Aymara \| ay
	Azerbaijani \| az
	Bashkir \| ba
	Belarusian \| be
	Bulgarian \| bg
	Bengali \| bn
	Breton \| br
	Bosnian \| bs
	Catalan \| ca
	Cebuano \| ceb
	Chokwe \| cjk
	Czech \| cs
	Welsh \| cy
	Danish \| da
	German \| de
	Dyula\| dyu
	Greek \| el
	English \| en
	Spanish \| es
	Estonian \| et
	Persian \| fa
	Fulah \| ff
	Finnish \| fi
	French \| fr
	Western Frisian \| fy
	Irish \| ga
	Scottish Gaelic \| gd
	Galician \| gl
	Gujarati \| gu
	Hausa \| ha
	Hebrew \| he
	Hindi \| hi
	Croatian \| hr
	Haitian Creole \| ht
	Hungarian \| hu
	Armenian \| hy
	Indonesian \| id
	Igbo \| ig
	Iloko \| ilo
	Icelandic \| is
	Italian \| it
	Japanese \| ja
	Javanese \| jv
	Georgian \| ka
	Kachin \| kac
	Kamba \| kam
	Kabuverdianu \| kea
	Kongo \| kg
	Kazakh \| kk
	Central Khmer \| km
	Kimbundu \| kmb
	Northern Kurdish \| kmr
	Kannada \| kn
	Korean \| ko
	Kurdish \| ku
	Kyrgyz \| ky
	Luxembourgish \| lb
	Ganda \| lg
	Lingala \| ln
	Lao \| lo
	Lithuanian \| lt
	Luo \| luo
	Latvian \| lv
	Malagasy \| mg
	Maori \| mi
	Macedonian \| mk
	Malayalam \| ml
	Mongolian \| mn
	Marathi \| mr
	Malay \| ms
	Maltese \| mt
	Burmese \| my
	Nepali \| ne
	Dutch \| nl
	Norwegian \| no
	Northern Sotho \| ns
	Nyanja \| ny
	Occitan \| oc
	Oromo \| om
	Oriya \| or
	Punjabi \| pa
	Polish \| pl
	Pashto \| ps
	Portuguese \| pt
	Quechua \| qu
	Romanian \| ro
	Russian \| ru
	Sindhi \| sd
	Shan \| shn
	Sinhala \| si
	Slovak \| sk
	Slovenian \| sl
	Shona \| sn
	Somali \| so
	Albanian \| sq
	Serbian \| sr
	Swati \| ss
	Sundanese \| su
	Swedish \| sv
	Swahili \| sw
	Tamil \| ta
	Telugu \| te
	Tajik \| tg
	Thai \| th
	Tigrinya \| ti
	Tagalog \| tl
	Tswana \| tn
	Turkish \| tr
	Ukrainian \| uk
	Umbundu \| umb
	Urdu \| ur
	Uzbek \| uz
	Vietnamese \| vi
	Wolof \| wo
	Xhosa \| xh
	Yiddish \| yi
	Yoruba \| yo
	Chinese\| zh
	Zulu \| zu