Translation
Fairseq
File size: 3,167 Bytes
1a05fb0
 
c5616f1
 
 
 
 
 
 
 
 
 
 
 
1a05fb0
bed2078
a237630
 
bed2078
a237630
 
 
bed2078
029b141
 
 
bed2078
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d760d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05e6612
1fe27b3
a993110
05e6612
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: cc-by-4.0
language:
- et
- fi
- kv
- hu
- lv
- 'no'
library_name: fairseq
metrics:
- bleu
- chrf
pipeline_tag: translation
---
# smugri3_14
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.

### Languages Supported
- **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
- **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty

### Usage
The model can be tested in our [web demo](https://translate.ut.ee/).


To use this model for translation tasks, you will need to utilize the [**Fairseq v0.12.2**](https://pypi.org/project/fairseq/0.12.2/).

Bash script example:
```
# Define target and source languages
src_lang="eng_Latn"
tgt_lang="kpv_Cyrl"

# Directories and paths
model_path=./smugri3_14-finno-ugric-nmt
checkpoint_path=${model_path}/smugri3_14.pt
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model
dictionary_path=${model_path}/nllb_model_dict.ext.txt

# Language settings for fairseq
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl"
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn"

# Start fairseq-interactive in interactive mode
fairseq-interactive ${model_path} \
  -s ${src_lang} -t ${tgt_lang} \
  --path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \
  --beam 4 --lenpen 1.0 \
  --bpe sentencepiece \
  --remove-bpe \
  --lang-tok-style multilingual \
  --sentencepiece-model ${sp_path} \
  --fixed-dictionary ${dictionary_path} \
  --task translation_multi_simple_epoch \
  --decoder-langtok --encoder-langtok src \
  --lang-pairs ${src_lang}-${tgt_lang} \
  --langs "${nllb_langs},${new_langs}" \
  --cpu
```

### Scores
Average:
| to-lang | bleu  | chrf  | chrf++  |
| ------- | ----- | ----  | ------  |
| ru      | 24.82 | 51.81 | 49.08   |
| en      | 28.24 | 55.91 | 53.73   |
| et      | 18.66 | 51.72 | 47.69   |
| fi      | 15.45 | 50.04 | 45.38   |
| hun     | 16.73 | 47.38 | 44.19   |
| lv      | 18.15 | 49.04 | 45.54   |
| nob     | 14.43 | 45.64 | 42.29   |
| kpv     | 10.73 | 42.34 | 38.50   |
| liv     | 5.16  | 29.95 | 27.28   |
| mdf     | 5.27  | 37.66 | 32.99   |
| mhr     | 8.51  | 43.42 | 38.76   |
| mns     | 2.45  | 27.75 | 24.03   |
| mrj     | 7.30  | 40.81 | 36.40   |
| myv     | 4.72  | 38.74 | 33.80   |
| olo     | 4.63  | 34.43 | 30.00   |
| udm     | 7.50  | 40.07 | 35.72   |
| krl     | 9.39  | 42.74 | 38.24   |
| vro     | 8.64  | 39.89 | 35.97   |
| vep     | 6.73  | 38.15 | 33.91   |
| lud     | 3.11  | 31.50 | 27.30   |

[All direction scores](https://docs.google.com/spreadsheets/d/1H-hLAvIxJ5TbMmECZqza6G5jfAjh90pmJdszwajwHiI/). 

Evaluated with [Smugri Flores testset](https://huggingface.co/datasets/tartuNLP/smugri-flores-testset).