|
--- |
|
pipeline_tag: translation |
|
license: "cc-by-nc-4.0" |
|
inference: false |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- ast |
|
- ay |
|
- az |
|
- ba |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- ceb |
|
- cjk |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- dyu |
|
- el |
|
- en |
|
- es |
|
- et |
|
- fa |
|
- ff |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- ilo |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kac |
|
- kam |
|
- kea |
|
- kg |
|
- kk |
|
- km |
|
- kmb |
|
- kmr |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- lb |
|
- lg |
|
- ln |
|
- lo |
|
- lt |
|
- luo |
|
- lv |
|
- mg |
|
- mi |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- mt |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- ns |
|
- ny |
|
- oc |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- qu |
|
- ro |
|
- ru |
|
- sd |
|
- shn |
|
- si |
|
- sk |
|
- sl |
|
- sn |
|
- so |
|
- sq |
|
- sr |
|
- ss |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- tg |
|
- th |
|
- ti |
|
- tl |
|
- tn |
|
- tr |
|
- uk |
|
- umb |
|
- ur |
|
- uz |
|
- vi |
|
- wo |
|
- xh |
|
- yi |
|
- yo |
|
- zh |
|
- zu |
|
--- |
|
|
|
# Flores101: Large-Scale Multilingual Machine Translation |
|
|
|
`flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository. |
|
|
|
The model architecture and config are the same as [M2M100](https://huggingface.co/facebook/m2m100_418M) implementation, but the **tokenizer should be modified** to adjust language codes. |
|
|
|
**Demo**: https://huggingface.co/spaces/seyoungsong/flores101_mm100_175M |
|
|
|
```python |
|
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer |
|
|
|
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" |
|
chinese_text = "生活就像一盒巧克力。" |
|
|
|
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M") |
|
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M") |
|
|
|
# FIX TOKENIZER! |
|
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5} |
|
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id} |
|
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()} |
|
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()} |
|
|
|
# translate Hindi to French |
|
tokenizer.src_lang = "hi" |
|
encoded_hi = tokenizer(hi_text, return_tensors="pt") |
|
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr")) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "La vie est comme une boîte de chocolat." |
|
|
|
# translate Chinese to English |
|
tokenizer.src_lang = "zh" |
|
encoded_zh = tokenizer(chinese_text, return_tensors="pt") |
|
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) |
|
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
# => "Life is like a chocolate box." |
|
``` |
|
|
|
## Languages covered |
|
|
|
| Language | lang code | |
|
| ---------------- | --------- | |
|
| Akrikaans | af | |
|
| Amharic | am | |
|
| Arabic | ar | |
|
| Assamese | as | |
|
| Asturian | ast | |
|
| Aymara | ay | |
|
| Azerbaijani | az | |
|
| Bashkir | ba | |
|
| Belarusian | be | |
|
| Bulgarian | bg | |
|
| Bengali | bn | |
|
| Breton | br | |
|
| Bosnian | bs | |
|
| Catalan | ca | |
|
| Cebuano | ceb | |
|
| Chokwe | cjk | |
|
| Czech | cs | |
|
| Welsh | cy | |
|
| Danish | da | |
|
| German | de | |
|
| Dyula | dyu | |
|
| Greek | el | |
|
| English | en | |
|
| Spanish | es | |
|
| Estonian | et | |
|
| Persian | fa | |
|
| Fulah | ff | |
|
| Finnish | fi | |
|
| French | fr | |
|
| Western Frisian | fy | |
|
| Irish | ga | |
|
| Scottish Gaelic | gd | |
|
| Galician | gl | |
|
| Gujarati | gu | |
|
| Hausa | ha | |
|
| Hebrew | he | |
|
| Hindi | hi | |
|
| Croatian | hr | |
|
| Haitian Creole | ht | |
|
| Hungarian | hu | |
|
| Armenian | hy | |
|
| Indonesian | id | |
|
| Igbo | ig | |
|
| Iloko | ilo | |
|
| Icelandic | is | |
|
| Italian | it | |
|
| Japanese | ja | |
|
| Javanese | jv | |
|
| Georgian | ka | |
|
| Kachin | kac | |
|
| Kamba | kam | |
|
| Kabuverdianu | kea | |
|
| Kongo | kg | |
|
| Kazakh | kk | |
|
| Central Khmer | km | |
|
| Kimbundu | kmb | |
|
| Northern Kurdish | kmr | |
|
| Kannada | kn | |
|
| Korean | ko | |
|
| Kurdish | ku | |
|
| Kyrgyz | ky | |
|
| Luxembourgish | lb | |
|
| Ganda | lg | |
|
| Lingala | ln | |
|
| Lao | lo | |
|
| Lithuanian | lt | |
|
| Luo | luo | |
|
| Latvian | lv | |
|
| Malagasy | mg | |
|
| Maori | mi | |
|
| Macedonian | mk | |
|
| Malayalam | ml | |
|
| Mongolian | mn | |
|
| Marathi | mr | |
|
| Malay | ms | |
|
| Maltese | mt | |
|
| Burmese | my | |
|
| Nepali | ne | |
|
| Dutch | nl | |
|
| Norwegian | no | |
|
| Northern Sotho | ns | |
|
| Nyanja | ny | |
|
| Occitan | oc | |
|
| Oromo | om | |
|
| Oriya | or | |
|
| Punjabi | pa | |
|
| Polish | pl | |
|
| Pashto | ps | |
|
| Portuguese | pt | |
|
| Quechua | qu | |
|
| Romanian | ro | |
|
| Russian | ru | |
|
| Sindhi | sd | |
|
| Shan | shn | |
|
| Sinhala | si | |
|
| Slovak | sk | |
|
| Slovenian | sl | |
|
| Shona | sn | |
|
| Somali | so | |
|
| Albanian | sq | |
|
| Serbian | sr | |
|
| Swati | ss | |
|
| Sundanese | su | |
|
| Swedish | sv | |
|
| Swahili | sw | |
|
| Tamil | ta | |
|
| Telugu | te | |
|
| Tajik | tg | |
|
| Thai | th | |
|
| Tigrinya | ti | |
|
| Tagalog | tl | |
|
| Tswana | tn | |
|
| Turkish | tr | |
|
| Ukrainian | uk | |
|
| Umbundu | umb | |
|
| Urdu | ur | |
|
| Uzbek | uz | |
|
| Vietnamese | vi | |
|
| Wolof | wo | |
|
| Xhosa | xh | |
|
| Yiddish | yi | |
|
| Yoruba | yo | |
|
| Chinese | zh | |
|
| Zulu | zu | |