seyoungsong's picture
r
68fdd8d verified
|
raw
history blame
7.15 kB
metadata
pipeline_tag: translation
license: mit
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - ast
  - ay
  - az
  - ba
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - ceb
  - cjk
  - cs
  - cy
  - da
  - de
  - dyu
  - el
  - en
  - es
  - et
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - ilo
  - is
  - it
  - ja
  - jv
  - ka
  - kac
  - kam
  - kea
  - kg
  - kk
  - km
  - kmb
  - kmr
  - kn
  - ko
  - ku
  - ky
  - lb
  - lg
  - ln
  - lo
  - lt
  - luo
  - lv
  - mg
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - ne
  - nl
  - 'no'
  - ns
  - ny
  - oc
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - ro
  - ru
  - sd
  - shn
  - si
  - sk
  - sl
  - sn
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - ti
  - tl
  - tn
  - tr
  - uk
  - umb
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zh
  - zu

Flores101: Large-Scale Multilingual Machine Translation

flores101_mm100_175M is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was introduced in this paper and released in this repository.

The model architecture and config are the same as M2M100 implementation, but the tokenizer should be modified to adjust language codes.

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"

model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")

# FIX TOKENIZER!
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}

# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."

# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a chocolate box."

Languages covered

Language lang code
Akrikaans af
Amharic am
Arabic ar
Assamese as
Asturian ast
Aymara ay
Azerbaijani az
Bashkir ba
Belarusian be
Bulgarian bg
Bengali bn
Breton br
Bosnian bs
Catalan ca
Cebuano ceb
Chokwe cjk
Czech cs
Welsh cy
Danish da
German de
Dyula dyu
Greek el
English en
Spanish es
Estonian et
Persian fa
Fulah ff
Finnish fi
French fr
Western Frisian fy
Irish ga
Scottish Gaelic gd
Galician gl
Gujarati gu
Hausa ha
Hebrew he
Hindi hi
Croatian hr
Haitian Creole ht
Hungarian hu
Armenian hy
Indonesian id
Igbo ig
Iloko ilo
Icelandic is
Italian it
Japanese ja
Javanese jv
Georgian ka
Kachin kac
Kamba kam
Kabuverdianu kea
Kongo kg
Kazakh kk
Central Khmer km
Kimbundu kmb
Northern Kurdish kmr
Kannada kn
Korean ko
Kurdish ku
Kyrgyz ky
Luxembourgish lb
Ganda lg
Lingala ln
Lao lo
Lithuanian lt
Luo luo
Latvian lv
Malagasy mg
Maori mi
Macedonian mk
Malayalam ml
Mongolian mn
Marathi mr
Malay ms
Maltese mt
Burmese my
Nepali ne
Dutch nl
Norwegian no
Northern Sotho ns
Nyanja ny
Occitan oc
Oromo om
Oriya or
Punjabi pa
Polish pl
Pashto ps
Portuguese pt
Quechua qu
Romanian ro
Russian ru
Sindhi sd
Shan shn
Sinhala si
Slovak sk
Slovenian sl
Shona sn
Somali so
Albanian sq
Serbian sr
Swati ss
Sundanese su
Swedish sv
Swahili sw
Tamil ta
Telugu te
Tajik tg
Thai th
Tigrinya ti
Tagalog tl
Tswana tn
Turkish tr
Ukrainian uk
Umbundu umb
Urdu ur
Uzbek uz
Vietnamese vi
Wolof wo
Xhosa xh
Yiddish yi
Yoruba yo
Chinese zh
Zulu zu