|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
<h1 align="center">mT5 small spanish es</h1> |
|
|
|
This is a Spanish fine-tuned version of Google's mT5-small model. |
|
|
|
https://huggingface.co/google/mt5-small |
|
|
|
|
|
# Datasets |
|
|
|
The datasets used for the fine-tuning |
|
|
|
Task Prefix |
|
Multinli (English) multi nli premise:[Text] hypo:[Text] |
|
Multinli (Spanish) multi nli premise:[Text] hypo:[Text] |
|
Pawx (English) pawx sentence1:[Text] sentence2:[Text] |
|
Pawx (Spanish) pawx sentence1:[Text] sentence2:[Text] |
|
Squad (English) question:[Text] context:[Text] |
|
Squad (Spanish) question:[Text] context:[Text] |
|
Translations (English-Spanish) translate English to Spanish:[Text] |
|
Translations (Spanish-English) translate Spanish to English:[Text] |
|
|
|
|
|
|
|
# Inference |
|
|
|
The following piece of code could be used to perfome the different model tasks. |
|
|
|
Translations |
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_name = "HURIDOCS/mt5-small-spanish-es" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
task = "translate Spanish to English:Esta frase es para probar el modelo" |
|
input_ids = tokenizer( |
|
[task], |
|
return_tensors="pt", |
|
padding="max_length", |
|
truncation=True, |
|
max_length=512 |
|
)["input_ids"] |
|
|
|
output_ids = model.generate( |
|
input_ids=input_ids, |
|
max_length=84, |
|
no_repeat_ngram_size=2, |
|
num_beams=4 |
|
)[0] |
|
|
|
result_text = tokenizer.decode( |
|
output_ids, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False |
|
) |
|
|
|
print(result_text) |
|
|
|
|
|
Question answering |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_name = "HURIDOCS/mt5-small-spanish-es" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
task = '''question:En qué país se encuentra Normandía? context:Los normandos (normandos: Nourmann; Francés: Normandos; Normanni) |
|
fue el pueblo que en los siglos X y XI dio su nombre a Normandía, una región de Francia. |
|
Eran descendientes de invasores nórdicos ('normandos" viene de "Norseman") y piratas de Dinamarca, Islandia y Noruega que, |
|
bajo su líder Rollo, acordaron jurar lealtad al rey Carlos III de Francia Occidental. A través de generaciones de asimilación |
|
y mezcla con las poblaciones nativas francas y galas romanas, sus descendientes se fusionarían gradualmente con las culturas |
|
carolingias de Francia Occidental. La identidad cultural y étnica distintiva de los normandos surgió inicialmente en la |
|
primera mitad del siglo X, y continuó evolucionando durante los siglos siguientes.''' |
|
|
|
input_ids = tokenizer( |
|
[task], |
|
return_tensors="pt", |
|
padding="max_length", |
|
truncation=True, |
|
max_length=512 |
|
)["input_ids"] |
|
|
|
output_ids = model.generate( |
|
input_ids=input_ids, |
|
max_length=84, |
|
no_repeat_ngram_size=2, |
|
num_beams=4 |
|
)[0] |
|
|
|
result_text = tokenizer.decode( |
|
output_ids, |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False |
|
) |
|
|
|
print(result_text) |
|
|
|
# Fine-tuning |
|
|
|
Check out the Transformers Libray examples |
|
|
|
https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering |
|
|
|
|
|
# Performance |
|
|
|
Spanish SQuAD v2 512 tokens |
|
|
|
Model Exact match F1 |
|
rank 1 mrm8488/distill-bert-base-spanish-wwm-cased 50.43% 71.45% |
|
rank 2 **mT5 small spanish es** 48.35% 62.03% |
|
rank 3 flan-t5-small 41.44% 56.48% |