|
--- |
|
language: |
|
- ru |
|
- uk |
|
- be |
|
- kk |
|
- az |
|
- hy |
|
- ka |
|
- he |
|
- en |
|
- de |
|
- multilingual |
|
tags: |
|
- language classification |
|
datasets: |
|
- open_subtitles |
|
- tatoeba |
|
- oscar |
|
--- |
|
|
|
# RoBERTa for Single Language Classification |
|
## Training |
|
RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language). |
|
|
|
| data source | language | |
|
|-----------------|----------------| |
|
| open_subtitles | ka, he, en, de | |
|
| oscar | be, kk, az, hu | |
|
| tatoeba | ru, uk | |
|
|
|
## Validation |
|
The metrics obtained from validation on the another part of dataset (~1k samples per language). |
|
|
|
|index|class|f1-score|precision|recall|support| |
|
|---|---|---|---|---|---| |
|
|0|az|0\.998|0\.997|1\.0|997| |
|
|1|be|0\.996|0\.998|0\.994|1004| |
|
|2|de|0\.976|0\.966|0\.987|979| |
|
|3|en|0\.976|0\.986|0\.967|1020| |
|
|4|he|1\.0|1\.0|0\.999|1001| |
|
|5|hy|0\.994|0\.991|0\.998|993| |
|
|6|ka|0\.999|0\.999|0\.999|1000| |
|
|7|kk|0\.996|0\.998|0\.993|1005| |
|
|8|uk|0\.982|0\.997|0\.968|1030| |
|
|9|ru|0\.982|0\.968|0\.997|971| |
|
|10|macro\_avg|0\.99|0\.99|0\.99|10000| |
|
|11|weighted avg|0\.99|0\.99|0\.99|10000| |