elmadany commited on
Commit
aa2116d
·
verified ·
1 Parent(s): 9ec4b10

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - aa
4
+ - af
5
+ - am
6
+ - ak
7
+ - bm
8
+ - ff
9
+ - fon
10
+ - ha
11
+ - ig
12
+ - ki
13
+ - lg
14
+ - ln
15
+ - mg
16
+ - nr
17
+ - om
18
+ - rn
19
+ - run
20
+ - sw
21
+ - sn
22
+ - tn
23
+ - ti
24
+ - ve
25
+ - wo
26
+ - xh
27
+ - yo
28
+ - zu
29
+ pipeline_tag: text-generation
30
+ tags:
31
+ - UBC
32
+ - African
33
+ - pytorch
34
+ - Chaeetah
35
+ - DLNLP
36
+ extra_gated_fields:
37
+ First Name: text
38
+ Last Name: text
39
+ Country: country
40
+ Affiliation: text
41
+ Job title:
42
+ type: select
43
+ options:
44
+ - Student
45
+ - Research Graduate
46
+ - AI researcher
47
+ - AI developer/engineer
48
+ - Reporter
49
+ - Other
50
+ I agree to use this model for non-commercial use ONLY: checkbox
51
+ I agree to cite the Cheetah paper: checkbox
52
+ geo: ip_location
53
+ By clicking Submit below I accept the terms of the license: checkbox
54
+ extra_gated_button_content: Submit
55
+ ---
56
+
57
+ <div style='text-align: justify;'>
58
+
59
+ This is the repository accompanying our ACL 2024 paper [Toucan: Many-to-Many Translation for 150 African Language Pairs](https://aclanthology.org/2024.findings-acl.781/).
60
+ We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create Toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU_1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa.
61
+
62
+ </div>
63
+
64
+ ## Models
65
+
66
+ <div style='text-align: justify;'>
67
+
68
+ To effectively train a MT language model for African languages, it is crucial to start with a powerful, Afrocentric pretrained language model. For this purpose, we select Cheetah (Adebara et al.,
69
+ 2024), a recently introduced SoTA model with extensive coverage encompassing 517 African languages. One limitation of Cheetah, however, is that it is available only in a base architecture, featuring
70
+ 580M parameters. Given our objective to develop a large-scale language model for machine translation capabale of serving 156 directions, this base model does not fully meet our requirements. To address this limitation, we embark on training larger and more expansive Afrocentric sequence-to-sequence models. We focus on two sizes: one model with 1.2B parameters and another with 3.7B parameters. We refer to the new models “Cheetah-1.2B” and “Cheetah-3.7B”, respectively, to reflect their enhanced capabilities and parameter scale. These models represent a significant advancement in our efforts to improve machine
71
+ translation for African languages, offering greater capacities in handling the rich linguistic nuances of African languages. Cheetah Pertaining. To train the new Cheetah models, we utilize the same pre-training dataset employed in training the original Cheetah-base model (Adebara et al., 2024). This strategic choice ensures consistency in the foundational data across models, enabling the advanced Cheetah-1.2B and Cheetah-3.7B versions to build upon the rich linguistic diversity captured in the original dataset. We refer to (Adebara et al., 2024) for more information about the pretraining data of Cheetah models. We employ a learning rate of 0.01, a batch size of 1, 024 sequences, and a maximum sequence length of 1, 024. Each model undergoes pretraining for 1 million steps. The training process is conducted on Google Cloud TPU with 128 cores (v3 − 128) provided by the TensorFlow Research Cloud (TFRC). We provide additional details on pretraining in Section B in the Appendix.
72
+
73
+ </div>
74
+
75
+ - Please refer to [**supported-languages**]("https://github.com/UBC-NLP/Cheetah/blob/main/supported-languages.txt")
76
+ - More details about Cheetah's pretraning data, visit Cheetah's GitHub [**Cheetah paper GitHub**]("https://github.com/UBC-NLP/Cheetah")
77
+ - More details about Toucan's pretraning data, visit Toucan's GitHub [**Toucan paper GitHub**]("https://github.com/UBC-NLP/Toucan")
78
+
79
+
80
+ | **Cheetah Models** | **Link** |
81
+ |---------|:------------------:|
82
+ | 🔥**Cheetah-base**🔥| [https://huggingface.co/UBC-NLP/cheetah-base](https://huggingface.co/UBC-NLP/cheetah-base)
83
+ | 🔥**Cheetah-1.2B**🔥| [https://huggingface.co/UBC-NLP/cheetah-1.2B](https://huggingface.co/UBC-NLP/cheetah-1.2B)
84
+ | 🔥**Cheetah-3.7B**🔥| TBA
85
+
86
+
87
+ | **Tocan Models** | **Link** |
88
+ |---------|:------------------:|
89
+ | 🔥**Toucan-base**🔥| [https://huggingface.co/UBC-NLP/toucan-base](https://huggingface.co/UBC-NLP/toucan-base)
90
+ | 🔥**Toucan-1.2B**🔥| [https://huggingface.co/UBC-NLP/toucan-1.2B](https://huggingface.co/UBC-NLP/toucan-1.2B)
91
+ | 🔥**Toucan-3.7B**🔥| TBA
92
+
93
+ # 3. How to use Cheetah-1.2B model
94
+
95
+ Below is an example for using **Cheetah-1.2B** predict masked tokens.
96
+ ``` bash
97
+ from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
98
+
99
+ tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/cheetah-1.2B")
100
+ model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/cheetah-1.2B")
101
+
102
+ yor_prompt="ìròyìn kan nípa owó ìjọba <extra_id_0> kan"
103
+
104
+ input_ids = tokenizer(yor_prompt, return_tensors="pt").input_ids
105
+ outputs = model.generate(input_ids)
106
+ print("Cheetah-1.2B - Tokenized input:", tokenizer.tokenize(yor_prompt))
107
+ print("Cheetah-1.2B - Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))
108
+
109
+ ```
110
+ Output:
111
+ ```bash
112
+ Cheetah-1.2B - Tokenized input: ['▁ìròyìn', '▁kan', '▁nípa', '▁owó', '▁ìjọba', '<extra_id_0>', '▁kan']
113
+ Cheetah-1.2B - Decoded output: Nàìjíríà
114
+ ```
115
+
116
+ # 3. How to use Toucan model
117
+ To translate using Toucan models, use the target language ISO-3 code as preix. Below the supported langauges
118
+ ```
119
+ lang_names={
120
+ "aar": "Afar",
121
+ "ach": "Acholi",
122
+ "afr": "Afrikaans",
123
+ "aka": "Akan",
124
+ "amh": "Amharic",
125
+ "bam": "Bambara",
126
+ "bas": "Basaa",
127
+ "bem": "Bemba",
128
+ "btg": "Bete Gagnoa",
129
+ "eng": "English",
130
+ "ewe": "Ewe",
131
+ "fon": "Fon",
132
+ "fra": "French",
133
+ "hau": "Hausa",
134
+ "ibo": "Igbo",
135
+ "kbp": "Kabiye",
136
+ "lgg": "Lugbara",
137
+ "lug": "Luganda",
138
+ "mlg": "Malagasy",
139
+ "nyn": "Nyakore",
140
+ "orm": "Oromo",
141
+ "som": "Somali",
142
+ "sot": "Sesotho",
143
+ "swa": "Swahili",
144
+ "tir": "Tigrinya",
145
+ "yor": "Yoruba",
146
+ "teo": "Ateso",
147
+ "gez": "Geez",
148
+ "wal": "Wolaytta",
149
+ "fan": "Fang",
150
+ "kau": "Kanuri",
151
+ "kin": "Kinyawanda",
152
+ "kon": "Kongo",
153
+ "lin": "Lingala",
154
+ "nya": "Chichewa",
155
+ "pcm": "Nigerian Pidgin",
156
+ "ssw": "Siswati",
157
+ "tsn": "Setswana",
158
+ "tso": "Tsonga",
159
+ "twi": "Twi",
160
+ "wol": "Wolof",
161
+ "xho": "Xhosa",
162
+ "zul": "Zulu",
163
+ "nnb": "Nande",
164
+ "swc": "Swahili Congo",
165
+ "ara": "Arabic"
166
+ }
167
+ ```
168
+ Below is an example for translating using **Toucan-1.2B**.
169
+ ``` bash
170
+ from transformers import AutoTokenizer, MT5ForConditionalGeneration
171
+ import torch
172
+ tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/toucan-1.2B")
173
+ model = MT5ForConditionalGeneration.from_pretrained("UBC-NLP/toucan-1.2B", torch_dtype=torch.float16, device_map="auto")
174
+ model.eval()
175
+
176
+ #Translate from Enlglish to Zulu
177
+ text="zul: Clear all items from the recent documents list"
178
+ input_ids = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True).to("cuda:0")
179
+ with torch.no_grad():
180
+ generated_ids = model.generate(**input_ids, num_beams=5, max_new_tokens=len(text), do_sample=True, temperature=0.6, top_p=0.9)
181
+ print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True, skip_prompt=True)[0])
182
+
183
+ ```
184
+ Output:
185
+ ```bash
186
+ Vala zonke izinto kusuka kwihlu lamadokhumende elidlule
187
+ ```
188
+
189
+
190
+
191
+ ## Citation
192
+ If you use the pre-trained model (Cheetah-1.2B) for your scientific publication, or if you find the resources in this repository useful, please cite our papers as follows (to be updated):
193
+
194
+
195
+ **Toucan's Paper**
196
+ ```
197
+ @inproceedings{adebara-etal-2024-cheetah,
198
+ title = "Cheetah: Natural Language Generation for 517 {A}frican Languages",
199
+ author = "Adebara, Ife and
200
+ Elmadany, AbdelRahim and
201
+ Abdul-Mageed, Muhammad",
202
+ editor = "Ku, Lun-Wei and
203
+ Martins, Andre and
204
+ Srikumar, Vivek",
205
+ booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
206
+ month = aug,
207
+ year = "2024",
208
+ address = "Bangkok, Thailand and virtual meeting",
209
+ publisher = "Association for Computational Linguistics",
210
+ url = "https://aclanthology.org/2024.acl-long.691",
211
+ pages = "12798--12823",
212
+ }
213
+ ```
214
+
215
+ **Cheetah's Paper**
216
+ ```
217
+ @inproceedings{elmadany-etal-2024-toucan,
218
+ title = "Toucan: Many-to-Many Translation for 150 {A}frican Language Pairs",
219
+ author = "Elmadany, AbdelRahim and
220
+ Adebara, Ife and
221
+ Abdul-Mageed, Muhammad",
222
+ editor = "Ku, Lun-Wei and
223
+ Martins, Andre and
224
+ Srikumar, Vivek",
225
+ booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
226
+ month = aug,
227
+ year = "2024",
228
+ address = "Bangkok, Thailand and virtual meeting",
229
+ publisher = "Association for Computational Linguistics",
230
+ url = "https://aclanthology.org/2024.findings-acl.781",
231
+ pages = "13189--13206",
232
+ }
233
+ ```
234
+
235
+ ## Acknowledgments
236
+ We gratefully acknowledges support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), [Digital Research Alliance of Canada](https://alliancecan.ca), [UBC ARC-Sockeye](https://arc.ubc.ca/ubc-arc-sockeye), Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC-Sockeye.