AlexTheSun
commited on
Commit
·
bed2078
1
Parent(s):
a237630
Update README.md with usage example
Browse files
README.md
CHANGED
@@ -1,12 +1,45 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
4 |
-
|
5 |
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.
|
6 |
|
7 |
-
|
8 |
- **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
|
9 |
- **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty
|
10 |
|
11 |
-
|
12 |
-
To use this model for translation tasks, you will need to utilize the Fairseq v0.12.2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
---
|
4 |
+
# smugri3_14
|
5 |
The TartuNLP Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. The model can translate in 702 directions, between 27 languages.
|
6 |
|
7 |
+
### Languages Supported
|
8 |
- **High and Mid-Resource Languages:** Estonian, English, Finnish, Hungarian, Latvian, Norwegian, Russian
|
9 |
- **Low-Resource Finno-Ugric Languages:** Komi, Komi Permyak, Udmurt, Hill Mari, Meadow Mari, Erzya, Moksha, Proper Karelian, Livvi Karelian, Ludian, Võro, Veps, Livonian, Northern Sami, Southern Sami, Inari Sami, Lule Sami, Skolt Sami, Mansi, Khanty
|
10 |
|
11 |
+
### Usage
|
12 |
+
To use this model for translation tasks, you will need to utilize the [**Fairseq v0.12.2**](https://pypi.org/project/fairseq/0.12.2/).
|
13 |
+
|
14 |
+
Bash script example:
|
15 |
+
```
|
16 |
+
# Define target and source languages
|
17 |
+
src_lang="eng_Latn"
|
18 |
+
tgt_lang="kpv_Cyrl"
|
19 |
+
|
20 |
+
# Directories and paths
|
21 |
+
model_path=./smugri3_14-finno-ugric-nmt
|
22 |
+
checkpoint_path=${model_path}/smugri3_14.pt
|
23 |
+
sp_path=${model_path}/flores200_sacrebleu_tokenizer_spm.ext.model
|
24 |
+
dictionary_path=${model_path}/nllb_model_dict.ext.txt
|
25 |
+
|
26 |
+
# Language settings for fairseq
|
27 |
+
nllb_langs="eng_Latn,est_Latn,fin_Latn,hun_Latn,lvs_Latn,nob_Latn,rus_Cyrl"
|
28 |
+
new_langs="kca_Cyrl,koi_Cyrl,kpv_Cyrl,krl_Latn,liv_Latn,lud_Latn,mdf_Cyrl,mhr_Cyrl,mns_Cyrl,mrj_Cyrl,myv_Cyrl,olo_Latn,sma_Latn,sme_Latn,smj_Latn,smn_Latn,sms_Latn,udm_Cyrl,vep_Latn,vro_Latn"
|
29 |
+
|
30 |
+
# Start fairseq-interactive in interactive mode
|
31 |
+
fairseq-interactive ${model_path} \
|
32 |
+
-s ${src_lang} -t ${tgt_lang} \
|
33 |
+
--path ${checkpoint_path} --max-tokens 20000 --buffer-size 1 \
|
34 |
+
--beam 4 --lenpen 1.0 \
|
35 |
+
--bpe sentencepiece \
|
36 |
+
--remove-bpe \
|
37 |
+
--lang-tok-style multilingual \
|
38 |
+
--sentencepiece-model ${sp_path} \
|
39 |
+
--fixed-dictionary ${dictionary_path} \
|
40 |
+
--task translation_multi_simple_epoch \
|
41 |
+
--decoder-langtok --encoder-langtok src \
|
42 |
+
--lang-pairs ${src_lang}-${tgt_lang} \
|
43 |
+
--langs "${nllb_langs},${new_langs}" \
|
44 |
+
--cpu
|
45 |
+
```
|