rockdrigoma
commited on
Commit
·
56423fb
1
Parent(s):
da9fda5
Update app.py
Browse files
app.py
CHANGED
@@ -2,9 +2,13 @@ import gradio as gr
|
|
2 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
3 |
|
4 |
article='''
|
5 |
-
#
|
6 |
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
|
7 |
|
|
|
|
|
|
|
|
|
8 |
|
9 |
## Model description
|
10 |
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
|
@@ -34,18 +38,18 @@ Since the Axolotl corpus contains misaligments, we just select the best samples
|
|
34 |
|:-----------------------------------------------------:|
|
35 |
| Anales de Tlatelolco |
|
36 |
| Diario |
|
37 |
-
| Documentos nauas de la Ciudad de
|
38 |
-
| Historia de
|
39 |
-
| La tinta negra y roja (
|
40 |
| Memorial Breve (Libro las ocho relaciones) |
|
41 |
-
|
|
42 |
| Nican Mopohua |
|
43 |
-
| Quinta
|
44 |
| Recetario Nahua de Milpa Alta D.F |
|
45 |
| Testimonios de la antigua palabra |
|
46 |
| Trece Poetas del Mundo Azteca |
|
47 |
-
| Una tortillita
|
48 |
-
| Vida
|
49 |
|
50 |
Also, to increase the amount of data we collected 3,000 extra samples from the web.
|
51 |
|
@@ -72,6 +76,12 @@ For a fair comparison, the models are evaluated on the same 505 validation Nahu
|
|
72 |
|
73 |
The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
|
74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
## References
|
76 |
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
|
77 |
of transfer learning with a unified Text-to-Text transformer.
|
|
|
2 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
3 |
|
4 |
article='''
|
5 |
+
# Spanish Nahuatl Automatic Translation
|
6 |
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the task of neural machine tranlation is hard due to the lack of structured data. The most popular datasets such as the Axolot dataset and the bible-corpus only consists of ~16,000 and ~7,000 samples respectivly. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, a single word from the Axolot dataset can be found written in more than three different ways. Therefore, in this work we leverage the T5 text-to-text sufix training strategy to compensate the lack of data. We first teach the multilingual model Spanish using English, then we make the transition to Spanish-Nahuatl. The resulting model successfully translates short sentences from Spanish to Nahuatl. We report Chrf and BLEU results.
|
7 |
|
8 |
+
## Motivation
|
9 |
+
|
10 |
+
One of the Sustainable Development Goals is "Reduced Inequalities". We know for sure that language is one
|
11 |
+
|
12 |
|
13 |
## Model description
|
14 |
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on spanish and nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
|
|
|
38 |
|:-----------------------------------------------------:|
|
39 |
| Anales de Tlatelolco |
|
40 |
| Diario |
|
41 |
+
| Documentos nauas de la Ciudad de México del siglo XVI |
|
42 |
+
| Historia de México narrada en náhuatl y español |
|
43 |
+
| La tinta negra y roja (antología de poesía náhuatl) |
|
44 |
| Memorial Breve (Libro las ocho relaciones) |
|
45 |
+
| Método auto-didáctico náhuatl-español |
|
46 |
| Nican Mopohua |
|
47 |
+
| Quinta Relación (Libro las ocho relaciones) |
|
48 |
| Recetario Nahua de Milpa Alta D.F |
|
49 |
| Testimonios de la antigua palabra |
|
50 |
| Trece Poetas del Mundo Azteca |
|
51 |
+
| Una tortillita nomás - Se taxkaltsin saj |
|
52 |
+
| Vida económica de Tenochtitlan |
|
53 |
|
54 |
Also, to increase the amount of data we collected 3,000 extra samples from the web.
|
55 |
|
|
|
76 |
|
77 |
The English-Spanish pretraining improves BLEU and Chrf, and leads to faster convergence.
|
78 |
|
79 |
+
# Team members
|
80 |
+
7 - Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
|
81 |
+
8 - Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
|
82 |
+
9 - Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
|
83 |
+
10 - Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
|
84 |
+
|
85 |
## References
|
86 |
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
|
87 |
of transfer learning with a unified Text-to-Text transformer.
|