File size: 8,723 Bytes
c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 9bb686f c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 9361852 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 9361852 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 c3c1258 f2f1b00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
library_name: transformers
tags: []
language:
- en
- fr
- es
- de
- el
- bg
- ru
- tr
- ar
- vi
- th
- zh
- ai
- sw
- ur
datasets:
- allenai/c4
---
<div align="center">
# Model Card for MrT5 Large
[**MrT5: Dynamic Token Merging for Efficient Byte-level Language Models**](https://arxiv.org/pdf/2410.20771)\
(Kallini et al., 2024)
</div>

<!-- Provide a quick summary of what the model is/does. -->
**MrT5** (**M**e**r**ge**T5**) is a more efficient variant of [ByT5 (Xue et al., 2022)](https://arxiv.org/abs/2105.13626) that integrates a token deletion mechanism in its encoder to *dynamically* shorten the input sequence length. After processing through a fixed number of encoder layers, a learned *delete gate* determines which tokens are to be removed and which are to be retained for subsequent layers. By effectively "merging" critical information from deleted tokens into a more compact sequence, MrT5 presents a solution to the practical limitations of existing byte-level models.
## Citation
If you use this model, please cite the MrT5 paper:
```bibtex
@inproceedings{
kallini2025mrt,
title={MrT5: Dynamic Token Merging for Efficient Byte-level Language Models},
author={Julie Kallini and Shikhar Murty and Christopher D Manning and Christopher Potts and R{\'o}bert Csord{\'a}s},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=VYWBMq1L7H}
}
```
Also cite the ByT5 paper:
```bibtex
@article{xue-etal-2022-byt5,
title = "{B}y{T}5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models",
author = "Xue, Linting and
Barua, Aditya and
Constant, Noah and
Al-Rfou, Rami and
Narang, Sharan and
Kale, Mihir and
Roberts, Adam and
Raffel, Colin",
editor = "Roark, Brian and
Nenkova, Ani",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2022.tacl-1.17",
doi = "10.1162/tacl_a_00461",
pages = "291--306",
}
```
## Model Details
This is the model card for the 1.23B-parameter **MrT5 Large** (`mrt5-large`), a more efficient variant of ByT5 Large (`google/byt5-large`). This model is trained to reduce sequence lengths by ~50% on average.
- **Developed by:** Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás
- **Model type:** MrT5
- **Languages:** English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
- **Fine-tuned from model:** [google/byt5-large](https://huggingface.co/google/byt5-large)
- **Sources for more information**:
- [GitHub Repository](https://github.com/jkallini/mrt5)
- [Paper](https://arxiv.org/abs/2410.20771)
### Model Architecture
MrT5 Large uses the model configuration of the standard ByT5 Large, which has a feed-forward dimensionality of 3840, a model dimensionality of 1536, 36 encoder layers, 12 decoder layers, 16 attention heads in each layer, and 1.23B total parameters.
MrT5 has an additional *delete gate*, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ=0.5, which means that the model reduces its encoder sequence length by ~50% after
the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.
MrT5 Large is initialized from ByT5 Large and fine-tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training.
The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention-is-off-by-one.html) in its attention mechanism.
## Uses
This model is an encoder-decoder architecture designed primarily for sequence-to-sequence tasks. While it can be used as-is for exploratory or academic purposes, fine-tuning is recommended to achieve optimal performance on specific downstream tasks.
To leverage the model’s deletion feature, please use the custom **MrT5Trainer** available in the [accompanying repository](https://github.com/jkallini/mrt5). This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine-tuning.
Because this is a base model built for academic and research explorations, it is not intended for production-grade deployments. Users should carefully evaluate the model’s outputs, especially in any setting where reliability and robustness are critical.
## Bias, Risks, and Limitations
Language models are known to exhibit various forms of social bias and may produce harmful or offensive content ([Bender et al., 2021](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922); [Bommasani et al., 2022](https://arxiv.org/abs/2108.07258); [Liang et al., 2022](https://arxiv.org/abs/2211.09110)). Like other language models, this model may produce biased or harmful outputs. It has not been fine-tuned for safety and should be used with caution, especially in sensitive contexts.
## How to Get Started with the Model
Like ByT5, MrT5 works on raw UTF-8 bytes and can be used without a tokenizer. Make sure to set `trust_remote_code=True` to load the MrT5 code:
```python
from transformers import AutoModelForSeq2SeqLM
import torch
model = AutoModelForSeq2SeqLM.from_pretrained('stanfordnlp/mrt5-large', trust_remote_code=True)
input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add 3 for special tokens
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3 # add 3 for special tokens
# Forward pass with hard deletion
loss = model(input_ids, labels=labels, hard_delete=True).loss
```
For batched inference and training, you can use ByT5's tokenizer class:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('stanfordnlp/mrt5-large', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('google/byt5-large')
model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids
# Forward pass with hard deletion
loss = model(**model_inputs, labels=labels, hard_delete=True).loss
```
## Training Details
### Training Data
For continued pre-training, we use the [multilingual C4 (mC4) corpus](https://huggingface.co/datasets/allenai/c4) ([Raffel et al., 2020](https://arxiv.org/abs/1910.10683); [Xue et al., 2021](https://arxiv.org/abs/2010.11934)). MrT5 is trained on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.
To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal-sized samples for each language (in terms of bytes) from the mC4 training split.
### Training Procedure
MrT5 is trained on the ByT5 span corruption pre-training objective. In this task, spans of tokens in unlabeled text data are replaced with a single *sentinel token* ID per span, and the model must fill in the missing tokens. For ByT5 and MrT5, these are spans of bytes, and the masks can potentially interfere with word boundaries.
#### Preprocessing
When training on the span corruption objective, we calculate the corrupted spans such that the average masked span length is 20 tokens with a noise density of 15%—that is, 15% of tokens in the sequence are masked out, following the specification outlined in the ByT5 paper.
#### Optimization
MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e-4 with linear decay and no warmup.
To achieve a specific sequence length reduction rate, we use a PI controller with a target deletion ratio of δ=0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.
## Environmental Impact
- **Hardware Type:** NVIDIA A100-SXM4-80GB
- **GPU Count**: 4
- **Hours used:** ~73 hours
- **Cloud Provider:** Stanford NLP Cluster
## Model Card Authors
Julie Kallini \
[email protected]
|