seyoungsong
commited on
add FIX TOKENIZER! instructions
Browse files
README.md
CHANGED
@@ -107,8 +107,6 @@ language:
|
|
107 |
|
108 |
# `flores101_mm100_175M`
|
109 |
|
110 |
-
https://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html
|
111 |
-
|
112 |
`flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was first released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository.
|
113 |
|
114 |
```python
|
@@ -117,8 +115,14 @@ from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
|
|
117 |
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
|
118 |
chinese_text = "生活就像一盒巧克力。"
|
119 |
|
120 |
-
model = M2M100ForConditionalGeneration.from_pretrained("
|
121 |
-
tokenizer = M2M100Tokenizer.from_pretrained("
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
|
123 |
# translate Hindi to French
|
124 |
tokenizer.src_lang = "hi"
|
@@ -132,7 +136,7 @@ tokenizer.src_lang = "zh"
|
|
132 |
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
|
133 |
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
|
134 |
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
135 |
-
# => "Life is like a box
|
136 |
```
|
137 |
|
138 |
## Languages covered
|
|
|
107 |
|
108 |
# `flores101_mm100_175M`
|
109 |
|
|
|
|
|
110 |
`flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was first released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository.
|
111 |
|
112 |
```python
|
|
|
115 |
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
|
116 |
chinese_text = "生活就像一盒巧克力。"
|
117 |
|
118 |
+
model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
|
119 |
+
tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")
|
120 |
+
|
121 |
+
# FIX TOKENIZER!
|
122 |
+
tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
|
123 |
+
tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
|
124 |
+
tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
|
125 |
+
tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}
|
126 |
|
127 |
# translate Hindi to French
|
128 |
tokenizer.src_lang = "hi"
|
|
|
136 |
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
|
137 |
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
|
138 |
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
|
139 |
+
# => "Life is like a chocolate box."
|
140 |
```
|
141 |
|
142 |
## Languages covered
|