|
|
|
--- |
|
language: |
|
- nl |
|
- en |
|
- multilingual |
|
license: apache-2.0 |
|
tags: |
|
- dutch |
|
- english |
|
- t5 |
|
- t5x |
|
- ul2 |
|
- seq2seq |
|
datasets: |
|
- yhavinga/mc4_nl_cleaned |
|
- yhavinga/nedd_wiki_news |
|
inference: false |
|
--- |
|
|
|
# ul2-large-dutch-english for Dutch and English |
|
|
|
Pretrained T5 model on Dutch and English using a UL2 (Mixture-of-Denoisers) objective. |
|
The T5 model was introduced in |
|
[this paper](https://arxiv.org/abs/1910.10683) |
|
and first released at [this page](https://github.com/google-research/text-to-text-transfer-transformer). |
|
The UL2 objective was introduced in |
|
[this paper](https://arxiv.org/abs/2205.05131) |
|
and first released at [this page](https://github.com/google-research/google-research/tree/master/ul2). |
|
|
|
**Note:** The Hugging Face inference widget is deactivated because this model needs a text-to-text fine-tuning on |
|
a specific downstream task to be useful in practice. |
|
|
|
## Model description |
|
|
|
T5 is an encoder-decoder model and treats all NLP problems in a text-to-text format. |
|
`ul2-large-dutch-english` T5 is a transformers model pretrained on a very large corpus of |
|
Dutch and English data in a self-supervised fashion. |
|
This means it was pretrained on the raw texts only, with no humans labelling them in any way |
|
(which is why it can use lots of publicly available data) with an automatic process to generate |
|
inputs and outputs from those texts. |
|
|
|
|
|
This model used the [T5 v1.1](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) improvements compared to the original T5 model during the pretraining: |
|
- GEGLU activation in the feed-forward hidden layer, rather than ReLU - see [here](https://arxiv.org/abs/2002.05202) |
|
- Dropout was turned off during pre-training. Dropout should be re-enabled during fine-tuning |
|
- Pre-trained on self-supervised objective only without mixing in the downstream tasks |
|
- No parameter sharing between embedding and classifier layer |
|
|
|
|
|
|
|
### UL2 pretraining objective |
|
|
|
This model was pretrained with the UL2's Mixture-of-Denoisers (MoD) objective, that combines diverse pre-training |
|
paradigms together. UL2 frames different objective functions for training language models as denoising tasks, where |
|
the model has to recover missing sub-sequences of a given input. During pre-training it uses a novel mixture-of-denoisers |
|
that samples from a varied set of such objectives, each with different configurations. UL2 is trained using a mixture of |
|
three denoising tasks: |
|
|
|
1. R-denoising (or regular span corruption), which emulates the standard T5 span corruption objective; |
|
2. X-denoising (or extreme span corruption); and |
|
3. S-denoising (or sequential PrefixLM). |
|
|
|
During pre-training, we sample from the available denoising tasks based on user-specified ratios. |
|
UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training |
|
denoising task. During the pre-training, a paradigm token is inserted to the input |
|
(`[NLU]` for R-denoising, `[NLG]` for X-denoising, or `[S2S]` for S-denoising) indicating the denoising task at hand. |
|
Then, during fine-tuning the same input token should be inserted to get the best performance for different downstream |
|
fine-tuning tasks. |
|
|
|
## Intended uses & limitations |
|
|
|
This model was only pretrained in a self-supervised way excluding any supervised training. |
|
Therefore, this model has to be fine-tuned before it is usable on a downstream task, |
|
like text classification, unlike the Google's original T5 model. |
|
|
|
**Note:** You most likely need to fine-tune these T5/UL2 models without mixed precision |
|
so fine-tune them with full fp32 precision. Fine-tuning with Flax in bf16 - `model.to_bf16()` - is possible |
|
if you set the mask correctly to exclude layernorm and embedding layers. Also note that the T5x pre-training |
|
and fine-tuning configs set `z_loss` to 1e-4, which is used to keep the loss scale from underflowing. |
|
You can also find more fine-tuning tips from [here](https://discuss.huggingface.co/t/t5-finetuning-tips), for example. |
|
|
|
**Note**: For fine-tuning, most likely you can get better results if you insert a prefix token |
|
of `[NLU]`, `[NLG]`, or `[S2S]` to your input texts. |
|
For general language understanding fine-tuning tasks, you could use the `[NLU]` token. |
|
For GPT-style causal language generation, you could use the `[S2S]` token. |
|
The token `[NLG]` of the X-denoising pretrain task is somewhat mix between the language understanding and causal language |
|
generation so the token `[NLG]` could maybe be used for language generation fine-tuning too. |
|
|
|
### How to use |
|
|
|
Here is how to use this model in PyTorch: |
|
|
|
```python |
|
from transformers import T5Tokenizer, T5ForConditionalGeneration |
|
|
|
tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-large-dutch-english", use_fast=False) |
|
model = T5ForConditionalGeneration.from_pretrained("yhavinga/ul2-large-dutch-english") |
|
``` |
|
|
|
and in Flax: |
|
|
|
```python |
|
from transformers import T5Tokenizer, FlaxT5ForConditionalGeneration |
|
|
|
tokenizer = T5Tokenizer.from_pretrained("yhavinga/ul2-large-dutch-english", use_fast=False) |
|
model = FlaxT5ForConditionalGeneration.from_pretrained("yhavinga/ul2-large-dutch-english") |
|
``` |
|
|
|
|
|
### Limitations and bias |
|
|
|
The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. |
|
Therefore, the model can have biased predictions. This bias will also affect all fine-tuned versions of this model. |
|
|
|
## Training data |
|
|
|
The `ul2-large-dutch-english` T5 model was pre-trained simultaneously on a combination of several datasets, |
|
including the `full_en_nl` config of the "mc4_nl_cleaned" dataset, which is a cleaned version of Common Crawl's web |
|
crawl corpus, Dutch books, the Dutch subset of Wikipedia (2022-03-20), the English subset of Wikipedia (2022-03-01), |
|
and a subset of "mc4_nl_cleaned" |
|
containing only texts from Dutch newspapers. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The ul2-large-dutch-english T5 model uses a SentencePiece unigram tokenizer with a vocabulary of 32,000 tokens. |
|
The tokenizer includes the special tokens `<pad>`, `</s>`, `<unk>`, known from the original T5 paper, |
|
`[NLU]`, `[NLG]` and `[S2S]` for the MoD pre-training, and `<n>` for newline. |
|
During pre-training with the UL2 objective, input and output sequences consist of 512 consecutive tokens. |
|
The tokenizer does not lowercase texts and is therefore case-sensitive; it distinguises |
|
between `dutch` and `Dutch`. |
|
Additionally, 100+28 extra tokens were added for pre-training tasks, resulting in a total of 32,128 tokens. |
|
|
|
### Pretraining |
|
The model was trained on TPUv3-8 VM, sponsored by the [Google TPU Research Cloud](https://sites.research.google/trc/about/), |
|
for 2650000 steps with a batch size of 64 |
|
(in total 84B tokens). |
|
The optimizer used was AdaFactor with learning rate warmup for 10K steps with a constant learning rate of 1e-2, |
|
and then an inverse square root decay (exponential decay) of the learning rate after. |
|
The model was trained with Google's Jax/Flax based [t5x framework](https://github.com/google-research/t5x) with help |
|
from [Stephenn Fernandes](https://huggingface.co/StephennFernandes) to get started writing task definitions that wrap |
|
HF datasets. |
|
|
|
The UL2 training objective code used with the [t5x framework](https://github.com/google-research/t5x) was copied and |
|
slightly modified from the [UL2 paper](https://arxiv.org/pdf/2205.05131.pdf) appendix chapter 9.2 by the authors |
|
of the Finnish ul2 models. Used UL2 objective code is available in the repository |
|
[Finnish-NLP/ul2-base-nl36-finnish](https://huggingface.co/Finnish-NLP/ul2-base-nl36-finnish) in the files `ul2_objective.py` and `tasks.py`. |
|
UL2's mixture-of-denoisers configuration was otherwise equal to the UL2 paper |
|
but for the rate of mixing denoisers, 20% for S-denoising was used (suggested at the paper chapter 4.5) |
|
and the rest was divided equally between the R-denoising and X-denoising (i.e. 40% for both). |
|
### Model list |
|
|
|
Models in this series: |
|
|
|
| | ul2-base-dutch-english | ul2-large-dutch-english | ul2-small-dutch-english | |
|
|:---------------------|:-------------------------|:--------------------------|:--------------------------| |
|
| model_type | t5 | t5 | t5 | |
|
| _pipeline_tag | text2text-generation | text2text-generation | text2text-generation | |
|
| d_model | 768 | 1024 | 512 | |
|
| d_ff | 2048 | 2816 | 1024 | |
|
| num_heads | 12 | 16 | 6 | |
|
| d_kv | 64 | 64 | 64 | |
|
| num_layers | 12 | 24 | 8 | |
|
| num_decoder_layers | 12 | 24 | 8 | |
|
| feed_forward_proj | gated-gelu | gated-gelu | gated-gelu | |
|
| dense_act_fn | gelu_new | gelu_new | gelu_new | |
|
| vocab_size | 32128 | 32128 | 32128 | |
|
| tie_word_embeddings | 0 | 0 | 0 | |
|
| torch_dtype | float32 | float32 | float32 | |
|
| _gin_batch_size | 128 | 64 | 128 | |
|
| _gin_z_loss | 0.0001 | 0.0001 | 0.0001 | |
|
| _gin_t5_config_dtype | 'bfloat16' | 'bfloat16' | 'bfloat16' | |
|
|
|
|
|
## Evaluation results |
|
|
|
See the evaluation section in the interactive [Pre-training Dutch T5 Models](https://huggingface.co/spaces/yhavinga/pre-training-dutch-t5-models) blog. |
|
|
|
## Acknowledgements |
|
|
|
This project would not have been possible without compute generously provided by Google through the |
|
[TPU Research Cloud](https://sites.research.google/trc/). |
|
Thanks to the [Finnish-NLP](https://huggingface.co/Finnish-NLP) authors for releasing their code for the UL2 objective and associated task definitions. |
|
Thanks to [Stephenn Fernandes](https://huggingface.co/StephennFernandes) for helping me get started with the t5x framework. |
|
|
|
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb-havinga-86530825/) |
|
|
|
|