File size: 3,167 Bytes
1a05fb0 c5616f1 1a05fb0 bed2078 a237630 bed2078 a237630 bed2078 029b141 bed2078 2d760d8 05e6612 1fe27b3 a993110 05e6612 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
license: cc-by-4.0
language:
- et
- fi
- kv
- hu
- lv
- 'no'
library_name: fairseq
metrics:
- bleu
- chrf
pipeline_tag: translation
---
# smugri3_14
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.
### Languages Supported
- **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
- **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty
### Usage
The model can be tested in our [web demo](https://translate.ut.ee/).
To use this model for translation tasks, you will need to utilize the [**Fairseq v0.12.2**](https://pypi.org/project/fairseq/0.12.2/).
Bash script example:
```
# Define target and source languages
src_lang="eng_Latn"
tgt_lang="kpv_Cyrl"
# Directories and paths
model_path=./smugri3_14-finno-ugric-nmt
checkpoint_path=${model_path}/smugri3_14.pt
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model
dictionary_path=${model_path}/nllb_model_dict.ext.txt
# Language settings for fairseq
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl"
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn"
# Start fairseq-interactive in interactive mode
fairseq-interactive ${model_path} \
-s ${src_lang} -t ${tgt_lang} \
--path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \
--beam 4 --lenpen 1.0 \
--bpe sentencepiece \
--remove-bpe \
--lang-tok-style multilingual \
--sentencepiece-model ${sp_path} \
--fixed-dictionary ${dictionary_path} \
--task translation_multi_simple_epoch \
--decoder-langtok --encoder-langtok src \
--lang-pairs ${src_lang}-${tgt_lang} \
--langs "${nllb_langs},${new_langs}" \
--cpu
```
### Scores
Average:
| to-lang | bleu | chrf | chrf++ |
| ------- | ----- | ---- | ------ |
| ru | 24.82 | 51.81 | 49.08 |
| en | 28.24 | 55.91 | 53.73 |
| et | 18.66 | 51.72 | 47.69 |
| fi | 15.45 | 50.04 | 45.38 |
| hun | 16.73 | 47.38 | 44.19 |
| lv | 18.15 | 49.04 | 45.54 |
| nob | 14.43 | 45.64 | 42.29 |
| kpv | 10.73 | 42.34 | 38.50 |
| liv | 5.16 | 29.95 | 27.28 |
| mdf | 5.27 | 37.66 | 32.99 |
| mhr | 8.51 | 43.42 | 38.76 |
| mns | 2.45 | 27.75 | 24.03 |
| mrj | 7.30 | 40.81 | 36.40 |
| myv | 4.72 | 38.74 | 33.80 |
| olo | 4.63 | 34.43 | 30.00 |
| udm | 7.50 | 40.07 | 35.72 |
| krl | 9.39 | 42.74 | 38.24 |
| vro | 8.64 | 39.89 | 35.97 |
| vep | 6.73 | 38.15 | 33.91 |
| lud | 3.11 | 31.50 | 27.30 |
[All direction scores](https://docs.google.com/spreadsheets/d/1H-hLAvIxJ5TbMmECZqza6G5jfAjh90pmJdszwajwHiI/).
Evaluated with [Smugri Flores testset](https://huggingface.co/datasets/tartuNLP/smugri-flores-testset). |