File size: 5,703 Bytes
fbc4457
cfd82da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9baf281
cfd82da
 
fbc4457
cfd82da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
language:
- af
- am
- ar
- az
- be
- bg
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- he
- hi
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- 'no'
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zu
license: mit
datasets:
- mc4
---

# MyT5



## Model Details

MyT5 (**My**te **T5**) is a multilingual language model based on T5 architecture.
The model uses a **m**orphologically-driven **byte** (**MYTE**) representation described in our paper [Limisiewicz et al., 2024](https://arxiv.org/pdf/2403.10691.pdf).

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
- **Funded by:** University of Washington Fellowship, Charles University Grant Agency
- **Model type:** T5
- **Language(s) (NLP):** Multilingual
- **License:** MIT

### Model Sizes

- **[Small](https://huggingface.co/Tomlim/myt5-small)**: 300M parameters
- **[Base](https://huggingface.co/Tomlim/myt5-base)**: 582M parameters
- **[Large](https://huggingface.co/Tomlim/myt5-large)**: 1.2B parameters
  
### Model Sources 

<!-- Provide the basic links for the model. -->

- **[Repository](https://github.com/tomlimi/MYTE)** 
- **[Paper](https://arxiv.org/pdf/2403.10691.pdf)** 

## How to Get Started with the Model

The snippet below shows the basic usage of the model for multilingual language modeling.
Custom Tokenizer is available in [GitHub](https://github.com/tomlimi/MYTE])repository, in `src/myt5/myt5_tokenizer.py`.
We also plan to release it on HuggingFace in the future.

```python
from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch

MODEL_SIZE = "large" # small, base, or large

model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()

pre_texts = ['"We now have',
            '„Mamy teraz myszy w wieku',
            '"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
              '4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
              '4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']

inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")


outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
```

## Training Details

### Training Data

The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.

### Preprocessing

Instead of UTF-8 bytes, we used morphologically-driven byte representation.
See the description in our [paper](https://arxiv.org/pdf/2403.10691.pdf) for more details.


### Training Hyperparameters

We used the same hyperparameters as in the original ByT5 paper.
The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.

### Computational Infrastructure

Models were trained on TPUs available through TPU Research Cloud (TRC).
We used v3-8 TPU for training small and base models and v3-32 for a large model.
The training for each instance took:

- **Small**: 90h
- **Base**: 230h
- **Large**: 190h

# Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

MyT5 models are compared with reimplementation of [ByT5](https://huggingface.co/docs/transformers/model_doc/byt5) models trained for 250,000 steps. 

## Language Modeling

We have evaluated LM performance on multi-parallel [FLORES 200](https://arxiv.org/pdf/2207.04672v3.pdf) corpus.
To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).

### Results

|       |           | ByT5 |        | MyT5 |        |
|-------|-----------|------|--------|------|--------|
|       |           | BPEB | T (ms) | BPEB | T (ms) |
| small | All       | 10.1 | 7.0    | 4.6  | 6.7    |
|       | Latin     | 4.6  | 5.9    | 4.2  | 6.6    |
|       | Non Latin | 18.1 | 8.5    | 5.1  | 6.8    |
| base  | All       | 8.2  | 11.5   | 5.8  | 8.9    |
|       | Latin     | 4.9  | 9.4    | 5.0  | 8.7    |
|       | Non Latin | 13.0 | 14.6   | 6.9  | 9.1    |
| large | All       | 13.4 | 31.8   | 4.6  | 26.7   |
|       | Latin     | 10.1 | 28.1   | 4.0  | 26.6   |
|       | Non Latin | 18.2 | 37.3   | 5.4  | 27.0   |

Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. 
The inference was run on an A40 GPU core.


## Citation

```bibtex
@misc{limisiewicz2024myte,
      title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling}, 
      author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
      year={2024},
      eprint={2403.10691},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```


## Model Card Author

[Tomasz Limisiewicz](mailto:[email protected])