|
--- |
|
language: ro |
|
license: apache-2.0 |
|
tags: |
|
- romanian |
|
- seq2seq |
|
- t5 |
|
datasets: dumitrescustefan/diacritic |
|
inference: true |
|
--- |
|
|
|
This is the fine-tuned [mt5-base-romanian](https://huggingface.co/dumitrescustefan/mt5-base-romanian) base model (**390M** parameters). |
|
|
|
The model was fine-tuned on the [romanian diacritics dataset](https://huggingface.co/datasets/dumitrescustefan/diacritic) for 150k steps with a batch of size 8. The encoder sequence length is 256 and the decoder sequence length is also 256. It was trained with the following [scripts](https://github.com/iliemihai/t5x_diacritics). |
|
|
|
### How to load the fine-tuned mt5x model |
|
|
|
```python |
|
from transformers import MT5ForConditionalGeneration, T5Tokenizer |
|
model = MT5ForConditionalGeneration.from_pretrained('iliemihai/mt5-base-romanian-diacritics') |
|
tokenizer = T5Tokenizer.from_pretrained('iliemihai/mt5-base-romanian-diacritics') |
|
input_text = "A inceput sa ii taie un fir de par, iar fata sta in fata, tine camasa de in in mana si canta nota SI." |
|
inputs = tokenizer(input_text, max_length=256, truncation=True, return_tensors="pt") |
|
outputs = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"]) |
|
output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(output) # this will print "A început să îi taie un fir de păr, iar fata stă în față, ține cămașa de in în mână și cântă nota SI" |
|
``` |
|
|
|
### Evaluation |
|
|
|
Evaluation will be done soon [here]() |
|
|
|
### Acknowledgements |
|
|
|
We'd like to thank [TPU Research Cloud](https://sites.research.google/trc/about/) for providing the TPUv3 cores we used to train these models! |
|
|
|
### Authors |
|
|
|
Yours truly, |
|
|
|
_[Stefan Dumitrescu](https://github.com/dumitrescustefan), [Mihai Ilie](https://github.com/iliemihai) and [Per Egil Kummervold](https://huggingface.co/north)_ |
|
|
|
|