seyoungsong commited on
Commit
8cbba67
·
verified ·
1 Parent(s): 7ff7a54

add FIX TOKENIZER! instructions

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -107,8 +107,6 @@ language:
107
 
108
  # `flores101_mm100_175M`
109
 
110
- https://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html
111
-
112
  `flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was first released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository.
113
 
114
  ```python
@@ -117,8 +115,14 @@ from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
117
  hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
118
  chinese_text = "生活就像一盒巧克力。"
119
 
120
- model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
121
- tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
 
 
 
 
 
 
122
 
123
  # translate Hindi to French
124
  tokenizer.src_lang = "hi"
@@ -132,7 +136,7 @@ tokenizer.src_lang = "zh"
132
  encoded_zh = tokenizer(chinese_text, return_tensors="pt")
133
  generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
134
  tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
135
- # => "Life is like a box of chocolate."
136
  ```
137
 
138
  ## Languages covered
 
107
 
108
  # `flores101_mm100_175M`
109
 
 
 
110
  `flores101_mm100_175M` is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. It was first released in [this](https://github.com/facebookresearch/fairseq/tree/main/examples/flores101) repository.
111
 
112
  ```python
 
115
  hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
116
  chinese_text = "生活就像一盒巧克力。"
117
 
118
+ model = M2M100ForConditionalGeneration.from_pretrained("seyoungsong/flores101_mm100_175M")
119
+ tokenizer: M2M100Tokenizer = M2M100Tokenizer.from_pretrained("seyoungsong/flores101_mm100_175M")
120
+
121
+ # FIX TOKENIZER!
122
+ tokenizer.lang_token_to_id = {t: i for t, i in zip(tokenizer.all_special_tokens, tokenizer.all_special_ids) if i > 5}
123
+ tokenizer.lang_code_to_token = {s.strip("_"): s for s in tokenizer.lang_token_to_id}
124
+ tokenizer.lang_code_to_id = {s.strip("_"): i for s, i in tokenizer.lang_token_to_id.items()}
125
+ tokenizer.id_to_lang_token = {i: s for s, i in tokenizer.lang_token_to_id.items()}
126
 
127
  # translate Hindi to French
128
  tokenizer.src_lang = "hi"
 
136
  encoded_zh = tokenizer(chinese_text, return_tensors="pt")
137
  generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
138
  tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
139
+ # => "Life is like a chocolate box."
140
  ```
141
 
142
  ## Languages covered