|
--- |
|
license: cc-by-4.0 |
|
--- |
|
# smugri3_14 |
|
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages. |
|
|
|
### Languages Supported |
|
- **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian |
|
- **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty |
|
|
|
### Usage |
|
The model can be tested in our [web demo](https://translate.ut.ee/). |
|
|
|
|
|
To use this model for translation tasks, you will need to utilize the [**Fairseq v0.12.2**](https://pypi.org/project/fairseq/0.12.2/). |
|
|
|
Bash script example: |
|
``` |
|
# Define target and source languages |
|
src_lang="eng_Latn" |
|
tgt_lang="kpv_Cyrl" |
|
|
|
# Directories and paths |
|
model_path=./smugri3_14-finno-ugric-nmt |
|
checkpoint_path=${model_path}/smugri3_14.pt |
|
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model |
|
dictionary_path=${model_path}/nllb_model_dict.ext.txt |
|
|
|
# Language settings for fairseq |
|
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl" |
|
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn" |
|
|
|
# Start fairseq-interactive in interactive mode |
|
fairseq-interactive ${model_path} \ |
|
-s ${src_lang} -t ${tgt_lang} \ |
|
--path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \ |
|
--beam 4 --lenpen 1.0 \ |
|
--bpe sentencepiece \ |
|
--remove-bpe \ |
|
--lang-tok-style multilingual \ |
|
--sentencepiece-model ${sp_path} \ |
|
--fixed-dictionary ${dictionary_path} \ |
|
--task translation_multi_simple_epoch \ |
|
--decoder-langtok --encoder-langtok src \ |
|
--lang-pairs ${src_lang}-${tgt_lang} \ |
|
--langs "${nllb_langs},${new_langs}" \ |
|
--cpu |
|
``` |
|
|
|
### Scores |
|
Average: |
|
| to-lang | bleu | chrf | chrf++ | |
|
| ------- | ----- | ---- | ------ | |
|
| ru | 24.82 | 51.81 | 49.08 | |
|
| en | 28.24 | 55.91 | 53.73 | |
|
| et | 18.66 | 51.72 | 47.69 | |
|
| fi | 15.45 | 50.04 | 45.38 | |
|
| hun | 16.73 | 47.38 | 44.19 | |
|
| lv | 18.15 | 49.04 | 45.54 | |
|
| nob | 14.43 | 45.64 | 42.29 | |
|
| kpv | 10.73 | 42.34 | 38.50 | |
|
| liv | 5.16 | 29.95 | 27.28 | |
|
| mdf | 5.27 | 37.66 | 32.99 | |
|
| mhr | 8.51 | 43.42 | 38.76 | |
|
| mns | 2.45 | 27.75 | 24.03 | |
|
| mrj | 7.30 | 40.81 | 36.40 | |
|
| myv | 4.72 | 38.74 | 33.80 | |
|
| olo | 4.63 | 34.43 | 30.00 | |
|
| udm | 7.50 | 40.07 | 35.72 | |
|
| krl | 9.39 | 42.74 | 38.24 | |
|
| vro | 8.64 | 39.89 | 35.97 | |
|
| vep | 6.73 | 38.15 | 33.91 | |
|
| lud | 3.11 | 31.50 | 27.30 | |
|
|