|
--- |
|
language: ro |
|
inference: false |
|
license: apache-2.0 |
|
--- |
|
|
|
This is a pretrained [MT5](https://github.com/google-research/multilingual-t5) large model (**973M** parameters). |
|
|
|
Training was performed with the span corruption task on a clean 80GB Romanian text corpus for 4M total steps with these [scripts](https://github.com/dumitrescustefan/t5x_models), starting from the 1M public mt5x-large checkpoint. The model was trained with an encoder and decoder sequence length of 512, and has the same mt5x vocabulary as the 1M multilingual checkpoint. |
|
|
|
**!! IMPORTANT !!** This model was pretrained on the span corruption MLM task, meaning this model is **not usable** in any downstream task **without finetuning** first! |
|
|
|
### How to load an mt5x model |
|
|
|
```python |
|
from transformers import MT5Model, T5Tokenizer |
|
|
|
model = MT5Model.from_pretrained('dumitrescustefan/mt5-large-romanian') |
|
tokenizer = T5Tokenizer.from_pretrained('dumitrescustefan/mt5-large-romanian') |
|
input_text = "Acesta este un test." |
|
target_text = "Acesta este" |
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
labels = tokenizer(text_target=target_text, return_tensors="pt") |
|
|
|
outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"]) |
|
hidden_states = outputs.last_hidden_state |
|
print(hidden_states.shape) # this will print [1, 4, 1024] |
|
``` |
|
|
|
Remember to always sanitize your text! Replace ``ş`` and ``ţ`` cedilla-letters to comma-letters with : |
|
```python |
|
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș") |
|
``` |
|
because the model was **not** trained on cedilla ``ş`` and ``ţ``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word. |
|
|
|
### Acknowledgements |
|
|
|
We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv4 cores we used to train these models! |
|
|
|
### Authors |
|
|
|
Yours truly, |
|
|
|
_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_ |
|
|