File size: 7,743 Bytes
ed95241 7c96d4d da9fda5 ad71c0f da9fda5 56423fb 9833d5a da9fda5 56423fb 0a36376 56423fb da9fda5 7ca117f da9fda5 18dbafb 7c96d4d da9fda5 7ca117f da9fda5 56423fb da9fda5 56423fb da9fda5 56423fb da9fda5 56423fb da9fda5 7ca117f da9fda5 7ca117f da9fda5 f54e327 da9fda5 f54e327 da9fda5 0a36376 da9fda5 a779736 da9fda5 f54e327 56423fb da9fda5 22d19c7 7ca117f 1ae55fd 18dbafb 333c807 4366f76 18dbafb 24dbdff 18dbafb fe03ab3 956e550 fe03ab3 956e550 8cff888 26b6523 ba4cd1f 74484cc 1271ce0 f2ea4cf 1271ce0 cf05f54 74b256d cf05f54 1271ce0 1ae55fd 34babb2 df5be20 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
import os
import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
os.environ["TOKENIZERS_PARALLELISM"] = "false"
article='''
# Spanish Nahuatl Automatic Translation
Nahuatl is the most widely spoken indigenous language in Mexico. However, training a neural network for the neural machine translation task is challenging due to the lack of structured data. The most popular datasets, such as the Axolot and bible-corpus, only consist of ~16,000 and ~7,000 samples, respectively. Moreover, there are multiple variants of Nahuatl, which makes this task even more difficult. For example, it is possible to find a single word from the Axolot dataset written in more than three different ways. Therefore, we leverage the T5 text-to-text prefix training strategy to compensate for the lack of data. We first train the multilingual model to learn Spanish and then adapt it to Nahuatl. The resulting T5 Transformer successfully translates short sentences. Finally, we report Chrf and BLEU results.
## Motivation
One of the United Nations Sustainable Development Goals is ["Reduced Inequalities"](https://www.un.org/sustainabledevelopment/inequality/). We know for sure that language is one of the most powerful tools we have and a way to distribute knowledge and experience. But most of the progress that has been done among important topics like technology, education, human rights and law, news and so on, is biased due to lack of resources in different languages. We expect this approach to become an important platform for others in order to reduce inequality and get all Nahuatl speakers closer to what they need to thrive and why not, share with us their valuable knowledge, costumes and way of living.
## Model description
This model is a T5 Transformer ([t5-small](https://huggingface.co/t5-small)) fine-tuned on Spanish and Nahuatl sentences collected from the web. The dataset is normalized using 'sep' normalization from [py-elotl](https://github.com/ElotlMX/py-elotl).
## Usage
```python
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
model.eval()
sentence = 'muchas flores son blancas'
input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# outputs = miak xochitl istak
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
```
## Approach
### Dataset
Since the Axolotl corpus contains misalignments, we select the best samples (12,207). We also use the [bible-corpus](https://github.com/christos-c/bible-corpus) (7,821).
| Axolotl best aligned books |
|:-----------------------------------------------------:|
| Anales de Tlatelolco |
| Diario |
| Documentos nauas de la Ciudad de México del siglo XVI |
| Historia de México narrada en náhuatl y español |
| La tinta negra y roja (antología de poesía náhuatl) |
| Memorial Breve (Libro las ocho relaciones) |
| Método auto-didáctico náhuatl-español |
| Nican Mopohua |
| Quinta Relación (Libro las ocho relaciones) |
| Recetario Nahua de Milpa Alta D.F |
| Testimonios de la antigua palabra |
| Trece Poetas del Mundo Azteca |
| Una tortillita nomás - Se taxkaltsin saj |
| Vida económica de Tenochtitlan |
Also, we collected 3,000 extra samples from the web to increase the data.
### Model and training
We employ two training stages using a multilingual T5-small. The advantage of this model is that it can handle different vocabularies and prefixes. T5-small is pre-trained on different tasks and languages (French, Romanian, English, German).
### Training-stage 1 (learning Spanish)
In training stage 1, we first introduce Spanish to the model. The goal is to learn a new language rich in data (Spanish) and not lose the previous knowledge. We use the English-Spanish [Anki](https://www.manythings.org/anki/) dataset, which consists of 118.964 text pairs. The model is trained till convergence, adding the prefix "Translate Spanish to English: "
### Training-stage 2 (learning Nahuatl)
We use the pre-trained Spanish-English model to learn Spanish-Nahuatl. Since the amount of Nahuatl pairs is limited, we also add 20,000 samples from the English-Spanish Anki dataset. This two-task training avoids overfitting and makes the model more robust.
### Training setup
We train the models on the same datasets for 660k steps using batch size = 16 and a learning rate of 2e-5.
## Evaluation results
We evaluate the models on the same 505 validation Nahuatl sentences for a fair comparison. Finally, we report the results using chrf and sacrebleu hugging face metrics:
| English-Spanish pretraining | Validation loss | BLEU | Chrf |
|:----------------------------:|:---------------:|:-----|-------:|
| False | 1.34 | 6.17 | 26.96 |
| True | 1.31 | 6.18 | 28.21 |
The English-Spanish pretraining improves BLEU and Chrf and leads to faster convergence. The evaluation is available on the [eval.ipynb](https://github.com/milmor/spanish-nahuatl-translation/blob/main/eval.ipynb) notebook.
## References
- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified Text-to-Text transformer.
- Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).
- https://github.com/christos-c/bible-corpus
- https://github.com/ElotlMX/py-elotl
## Team members
- Emilio Alejandro Morales [(milmor)](https://huggingface.co/milmor)
- Rodrigo Martínez Arzate [(rockdrigoma)](https://huggingface.co/rockdrigoma)
- Luis Armando Mercado [(luisarmando)](https://huggingface.co/luisarmando)
- Jacobo del Valle [(jjdv)](https://huggingface.co/jjdv)
'''
model = AutoModelForSeq2SeqLM.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('hackathon-pln-es/t5-small-spanish-nahuatl')
def predict(input):
input_ids = tokenizer('translate Spanish to Nahuatl: ' + input, return_tensors='pt').input_ids
outputs = model.generate(input_ids, max_length=512)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
return outputs
HF_TOKEN = os.getenv('spanish-nahuatl-flagging')
hf_writer = gr.HuggingFaceDatasetSaver(HF_TOKEN, "spanish-nahuatl-flagging")
gr.Interface(
fn=predict,
inputs=gr.components.Textbox(lines=1, label="Input Text in Spanish"),
outputs=[
gr.components.Textbox(label="Translated text in Nahuatl"),
],
theme=None,
title='🌽 Spanish to Nahuatl Automatic Translation',
description='Insert your text in Spanish in the left text box and you will get its Nahuatl translation on the right text box',
examples=[
'conejo',
'estrella',
'Muchos perros son blancos',
'te amo',
'quiero comer',
'esto se llama agua',
'Mi hermano es un ajolote',
'mi abuelo se llama Juan',
'El pueblo del ajolote',
'te amo con todo mi corazón'],
article=article,
allow_flagging="never",
).launch(enable_queue=True)
|