MyT5
Model Details
MyT5 (Myte T5) is a multilingual language model based on T5 architecture. The model uses a morphologically-driven byte (MYTE) representation described in our paper Limisiewicz et al., 2024.
Model Description
- Developed by: Tomasz Limisiewicz, Terra Blevins, Hila Gonen, Orevaoghene Ahia, Luke Zettlemoyer
- Funded by: University of Washington Fellowship, Charles University Grant Agency
- Model type: T5
- Language(s) (NLP): Multilingual
- License: MIT
Model Sizes
Model Sources
How to Get Started with the Model
The snippet below shows the basic usage of the model for multilingual language modeling.
Custom Tokenizer is available in GitHubrepository, in src/myt5/myt5_tokenizer.py
.
We also plan to release it on HuggingFace in the future.
from transformers import T5ForConditionalGeneration
from src.myt5.myt5_tokenizer import MyT5Tokenizer
import torch
MODEL_SIZE = "large" # small, base, or large
model = T5ForConditionalGeneration.from_pretrained(f"Tomlim/MyT5_{MODEL_SIZE}", use_safetensors=True)
tokenizer = MyT5Tokenizer()
pre_texts = ['"We now have',
'„Mamy teraz myszy w wieku',
'"""எங்களிடம் இப்போது']
post_texts = ['4-month-old mice that are non-diabetic that used to be diabetic," he added.',
'4 miesięcy, które miały cukrzycę, ale zostały z niej wyleczone” – dodał.',
'4-மாத-வயதுடைய எலி ஒன்று உள்ளது, முன்னர் அதற்கு நீரிழிவு இருந்தது தற்போது இல்லை"" என்று அவர் மேலும் கூறினார்."']
inputs = tokenizer(pre_texts, padding="longest", return_tensors="pt")
targets = tokenizer(post_texts, padding="longest", return_tensors="pt")
outputs = model(**inputs, labels=targets.input_ids)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
Training Details
Training Data
The model was trained on the standard T5 task of restoring corrupted spans in the multilingual MC4 dataset.
Preprocessing
Instead of UTF-8 bytes, we used morphologically-driven byte representation. See the description in our paper for more details.
Training Hyperparameters
We used the same hyperparameters as in the original ByT5 paper. The only difference is that we decreased the number of training steps to 250,000 to avoid overfiting.
Computational Infrastructure
Models were trained on TPUs available through TPU Research Cloud (TRC). We used v3-8 TPU for training small and base models and v3-32 for a large model. The training for each instance took:
- Small: 90h
- Base: 230h
- Large: 190h
Evaluation
MyT5 models are compared with reimplementation of ByT5 models trained for 250,000 steps.
Language Modeling
We have evaluated LM performance on multi-parallel FLORES 200 corpus. To compare the scores across languages and models, we used a normalized metric, i.e., Bit-per-English-Byte (BPEB).
Results
ByT5 | MyT5 | ||||
---|---|---|---|---|---|
BPEB | T (ms) | BPEB | T (ms) | ||
small | All | 10.1 | 7.0 | 4.6 | 6.7 |
Latin | 4.6 | 5.9 | 4.2 | 6.6 | |
Non Latin | 18.1 | 8.5 | 5.1 | 6.8 | |
base | All | 8.2 | 11.5 | 5.8 | 8.9 |
Latin | 4.9 | 9.4 | 5.0 | 8.7 | |
Non Latin | 13.0 | 14.6 | 6.9 | 9.1 | |
large | All | 13.4 | 31.8 | 4.6 | 26.7 |
Latin | 10.1 | 28.1 | 4.0 | 26.6 | |
Non Latin | 18.2 | 37.3 | 5.4 | 27.0 |
Byte-per-English-Bits and Inference times (average per Flores 200 sentence) averaged for three language groupings. The inference was run on an A40 GPU core.
Downstream Tasks
We tested the large model in four end-tasks: question answering, NER, semantic parsing, and machine translation. The test data come from XTREME-UP benchmark (Ruder, Clark et al., 2023), which covers mainly low-resource languages
Fine-tuning
In each task, we fine-tuned for all languages jointly. We used 1e-3 learning rate with square root decay and dropout of 0.1. The batch size and training varied across tasks:
- NER: 128 examples per batch, 6000 steps
- QA: 64 examples per batch, 6500 steps
- Semantic Parsing: 64 examples per batch, 1000 steps
- MT: 64 examples per batch, 10000 steps
Results
Task | QA (F1) | NER (F1) | Semantic Parsing (EM) | MT (chrF) |
---|---|---|---|---|
Flan-PaLM* | 22.9 | 12.0 | 0.1 | --- |
mT5* | 59.7 | 74.0 | 21.8 | --- |
ByT5 | 73.2 | 81.5 | 25.1 | 20.1 |
MyT5 | 75.3 | 80.8 | 19.6 | 20.4 |
Inference Times per example (ms) | ||||
ByT5 | 36.2 | 13.8 | 13.2 | 15.9 |
MyT5 | 35.6 | 12.6 | 12.4 | 12.6 |
The average result of XTREME-UP tasks across low-resource languages. The baseline results of mT5 and Flan-PaLM (in-context-learning evaluation) are reported in Ruder, Clark et al., 2023. The reported inference time is an average across evaluation examples; the inference was run on an A40 GPU core.
Citation
@misc{limisiewicz2024myte,
title={MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling},
author={Tomasz Limisiewicz and Terra Blevins and Hila Gonen and Orevaoghene Ahia and Luke Zettlemoyer},
year={2024},
eprint={2403.10691},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model Card Author
- Downloads last month
- 10