gabriel-p's picture
Update README.md
0a1815a verified
metadata
license: apache-2.0

mT5 small spanish es

This is a Spanish fine-tuned version of Google's mT5-small model.

    https://huggingface.co/google/mt5-small

Datasets

The datasets used for the fine-tuning

    Task                                    Prefix
    Multinli (English)                      multi nli premise:[Text]  hypo:[Text]
    Multinli (Spanish)                      multi nli premise:[Text]  hypo:[Text]
    Pawx (English)                          pawx sentence1:[Text] sentence2:[Text]
    Pawx (Spanish)                          pawx sentence1:[Text] sentence2:[Text]
    Squad (English)                         question:[Text] context:[Text]
    Squad (Spanish)                         question:[Text] context:[Text]
    Translations (English-Spanish)          translate English to Spanish:[Text]
    Translations (Spanish-English)          translate Spanish to English:[Text]

Inference

The following piece of code could be used to perfome the different model tasks.

Translations

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_name = "HURIDOCS/mt5-small-spanish-es"
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    task = "translate Spanish to English:Esta frase es para probar el modelo"
    input_ids = tokenizer(
        [task],
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )["input_ids"]
    
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=84,
        no_repeat_ngram_size=2,
        num_beams=4
    )[0]
    
    result_text = tokenizer.decode(
        output_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    
    print(result_text)

Question answering

    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
    
    model_name = "HURIDOCS/mt5-small-spanish-es"
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    
    task = '''question:En qué país se encuentra Normandía? context:Los normandos (normandos: Nourmann; Francés: Normandos; Normanni) 
    fue el pueblo que en los siglos X y XI dio su nombre a Normandía, una región de Francia. 
    Eran descendientes de invasores nórdicos ('normandos" viene de "Norseman") y piratas de Dinamarca, Islandia y Noruega que, 
    bajo su líder Rollo, acordaron jurar lealtad al rey Carlos III de Francia Occidental. A través de generaciones de asimilación 
    y mezcla con las poblaciones nativas francas y galas romanas, sus descendientes se fusionarían gradualmente con las culturas 
    carolingias de Francia Occidental. La identidad cultural y étnica distintiva de los normandos surgió inicialmente en la 
    primera mitad del siglo X, y continuó evolucionando durante los siglos siguientes.'''

    input_ids = tokenizer(
        [task],
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )["input_ids"]
    
    output_ids = model.generate(
        input_ids=input_ids,
        max_length=84,
        no_repeat_ngram_size=2,
        num_beams=4
    )[0]
    
    result_text = tokenizer.decode(
        output_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    
    print(result_text)

Fine-tuning

Check out the Transformers Libray examples

https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering

Performance

Spanish SQuAD v2 512 tokens

              Model                                            Exact match     F1
    rank 1    mrm8488/distill-bert-base-spanish-wwm-cased      50.43%          71.45%
    rank 2    **mT5 small spanish es**                         48.35%          62.03%
    rank 3    flan-t5-small                                    41.44%          56.48%