Spaces:
Sleeping
Sleeping
metadata
pipeline_tag: translation
license: mit
language:
- multilingual
- af
- am
- ar
- as
- ast
- ay
- az
- ba
- be
- bg
- bn
- br
- bs
- ca
- ceb
- cjk
- cs
- cy
- da
- de
- dyu
- el
- en
- es
- et
- fa
- ff
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- ilo
- is
- it
- ja
- jv
- ka
- kac
- kam
- kea
- kg
- kk
- km
- kmb
- kmr
- kn
- ko
- ku
- ky
- lb
- lg
- ln
- lo
- lt
- luo
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ns
- ny
- oc
- om
- or
- pa
- pl
- ps
- pt
- qu
- ro
- ru
- sd
- shn
- si
- sk
- sl
- sn
- so
- sq
- sr
- ss
- su
- sv
- sw
- ta
- te
- tg
- th
- ti
- tl
- tn
- tr
- uk
- umb
- ur
- uz
- vi
- wo
- xh
- yi
- yo
- zh
- zu
Flores101: Large-Scale Multilingual Machine Translation
flores101_mm100_175M
is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was introduced in this paper and released in this repository.
The model architecture and config are the same as M2M100 implementation, but the tokenizer should be modified to adjust language codes.
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")
# FIX TOKENIZER!
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}
# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a chocolate box."
Languages covered
Language | lang code |
---|---|
Akrikaans | af |
Amharic | am |
Arabic | ar |
Assamese | as |
Asturian | ast |
Aymara | ay |
Azerbaijani | az |
Bashkir | ba |
Belarusian | be |
Bulgarian | bg |
Bengali | bn |
Breton | br |
Bosnian | bs |
Catalan | ca |
Cebuano | ceb |
Chokwe | cjk |
Czech | cs |
Welsh | cy |
Danish | da |
German | de |
Dyula | dyu |
Greek | el |
English | en |
Spanish | es |
Estonian | et |
Persian | fa |
Fulah | ff |
Finnish | fi |
French | fr |
Western Frisian | fy |
Irish | ga |
Scottish Gaelic | gd |
Galician | gl |
Gujarati | gu |
Hausa | ha |
Hebrew | he |
Hindi | hi |
Croatian | hr |
Haitian Creole | ht |
Hungarian | hu |
Armenian | hy |
Indonesian | id |
Igbo | ig |
Iloko | ilo |
Icelandic | is |
Italian | it |
Japanese | ja |
Javanese | jv |
Georgian | ka |
Kachin | kac |
Kamba | kam |
Kabuverdianu | kea |
Kongo | kg |
Kazakh | kk |
Central Khmer | km |
Kimbundu | kmb |
Northern Kurdish | kmr |
Kannada | kn |
Korean | ko |
Kurdish | ku |
Kyrgyz | ky |
Luxembourgish | lb |
Ganda | lg |
Lingala | ln |
Lao | lo |
Lithuanian | lt |
Luo | luo |
Latvian | lv |
Malagasy | mg |
Maori | mi |
Macedonian | mk |
Malayalam | ml |
Mongolian | mn |
Marathi | mr |
Malay | ms |
Maltese | mt |
Burmese | my |
Nepali | ne |
Dutch | nl |
Norwegian | no |
Northern Sotho | ns |
Nyanja | ny |
Occitan | oc |
Oromo | om |
Oriya | or |
Punjabi | pa |
Polish | pl |
Pashto | ps |
Portuguese | pt |
Quechua | qu |
Romanian | ro |
Russian | ru |
Sindhi | sd |
Shan | shn |
Sinhala | si |
Slovak | sk |
Slovenian | sl |
Shona | sn |
Somali | so |
Albanian | sq |
Serbian | sr |
Swati | ss |
Sundanese | su |
Swedish | sv |
Swahili | sw |
Tamil | ta |
Telugu | te |
Tajik | tg |
Thai | th |
Tigrinya | ti |
Tagalog | tl |
Tswana | tn |
Turkish | tr |
Ukrainian | uk |
Umbundu | umb |
Urdu | ur |
Uzbek | uz |
Vietnamese | vi |
Wolof | wo |
Xhosa | xh |
Yiddish | yi |
Yoruba | yo |
Chinese | zh |
Zulu | zu |