File size: 4,329 Bytes
8d005b7
 
 
b779edc
4031760
b779edc
0a1815a
b779edc
 
 
 
4031760
b779edc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4031760
b779edc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4031760
b779edc
 
 
 
 
 
4031760
b779edc
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
---

<h1 align="center">mT5 small spanish es</h1>

 This is a Spanish fine-tuned version of Google's mT5-small model.

        https://huggingface.co/google/mt5-small


# Datasets

The datasets used for the fine-tuning

        Task                                    Prefix
        Multinli (English)                      multi nli premise:[Text]  hypo:[Text]
        Multinli (Spanish)                      multi nli premise:[Text]  hypo:[Text]
        Pawx (English)                          pawx sentence1:[Text] sentence2:[Text]
        Pawx (Spanish)                          pawx sentence1:[Text] sentence2:[Text]
        Squad (English)                         question:[Text] context:[Text]
        Squad (Spanish)                         question:[Text] context:[Text]
        Translations (English-Spanish)          translate English to Spanish:[Text]
        Translations (Spanish-English)          translate Spanish to English:[Text]



# Inference

The following piece of code could be used to perfome the different model tasks.

Translations

        from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
        
        model_name = "HURIDOCS/mt5-small-spanish-es"
        
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        
        task = "translate Spanish to English:Esta frase es para probar el modelo"
        input_ids = tokenizer(
            [task],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=512
        )["input_ids"]
        
        output_ids = model.generate(
            input_ids=input_ids,
            max_length=84,
            no_repeat_ngram_size=2,
            num_beams=4
        )[0]
        
        result_text = tokenizer.decode(
            output_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )
        
        print(result_text)


Question answering


        from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
        
        model_name = "HURIDOCS/mt5-small-spanish-es"
        
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        
        task = '''question:En qué país se encuentra Normandía? context:Los normandos (normandos: Nourmann; Francés: Normandos; Normanni) 
        fue el pueblo que en los siglos X y XI dio su nombre a Normandía, una región de Francia. 
        Eran descendientes de invasores nórdicos ('normandos" viene de "Norseman") y piratas de Dinamarca, Islandia y Noruega que, 
        bajo su líder Rollo, acordaron jurar lealtad al rey Carlos III de Francia Occidental. A través de generaciones de asimilación 
        y mezcla con las poblaciones nativas francas y galas romanas, sus descendientes se fusionarían gradualmente con las culturas 
        carolingias de Francia Occidental. La identidad cultural y étnica distintiva de los normandos surgió inicialmente en la 
        primera mitad del siglo X, y continuó evolucionando durante los siglos siguientes.'''

        input_ids = tokenizer(
            [task],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=512
        )["input_ids"]
        
        output_ids = model.generate(
            input_ids=input_ids,
            max_length=84,
            no_repeat_ngram_size=2,
            num_beams=4
        )[0]
        
        result_text = tokenizer.decode(
            output_ids,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )
        
        print(result_text)

# Fine-tuning

Check out the Transformers Libray examples

https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering


# Performance

Spanish SQuAD v2 512 tokens

                  Model                                            Exact match     F1
        rank 1    mrm8488/distill-bert-base-spanish-wwm-cased      50.43%          71.45%
        rank 2    **mT5 small spanish es**                         48.35%          62.03%
        rank 3    flan-t5-small                                    41.44%          56.48%