|
--- |
|
language: |
|
- ru |
|
license: apache-2.0 |
|
--- |
|
|
|
# FRED-T5 large 800M (Full-scale Russian Enhanced Denoisers T5) |
|
|
|
Architecture based on T5. |
|
|
|
It has 24 layers and 1024 hidden size. More details in config.json. |
|
|
|
The model trained on a mixture of 7 denoisers like UL2 with several differences (https://arxiv.org/abs/2205.05131). |
|
|
|
It was trained on Russian language corpus (300GB). The dataset is the same as for ruT5 models. |
|
|
|
Bbpe tokenizer. 50257 + special tokens 107. Prefix tokens: '\<LM\>', '\<SC1>',.. '\<SC6>' |
|
|
|
First half of the time model trained on the small part of all dataset (1%,3GB) and without prefixes in each task. |
|
|
|
For RSG, we trained as described in the T5 paper. First, we trained multitask for all tasks. Then we took the best checkpoint for the task and trained it further. |
|
RSG submit here https://russiansuperglue.com/login/submit_info/1936 |
|
|
|
Total training time was around 35 days on 160 V100 GPUs. |
|
|
|
We'll release checkpoint to the public soon. |