Commit
·
2836b3f
1
Parent(s):
fb7d5d2
readme update #4
Browse files
README.md
CHANGED
@@ -12,6 +12,8 @@ license: apache-2.0
|
|
12 |
language:
|
13 |
- ru
|
14 |
- kk
|
|
|
|
|
15 |
---
|
16 |
|
17 |
# kazRush-ru-kk
|
@@ -23,11 +25,7 @@ kazRush-ru-kk is a translation model for translating from Russian to Kazakh.
|
|
23 |
|
24 |
## Usage
|
25 |
|
26 |
-
Using the model requires
|
27 |
-
|
28 |
-
```bash
|
29 |
-
pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
|
30 |
-
```
|
31 |
|
32 |
After installing necessary dependencies the model can be run with the following code:
|
33 |
|
@@ -35,13 +33,14 @@ After installing necessary dependencies the model can be run with the following
|
|
35 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
36 |
import torch
|
37 |
|
38 |
-
|
|
|
39 |
tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
|
40 |
|
|
|
41 |
def generate(text, **kwargs):
|
42 |
-
inputs = tokenizer(text, return_tensors='pt').to(
|
43 |
-
|
44 |
-
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
|
45 |
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
|
46 |
|
47 |
print(generate("Как Кока-Кола может помочь автомобилисту?"))
|
@@ -56,32 +55,31 @@ You can also access the model via _pipeline_ wrapper:
|
|
56 |
[{'translation_text': 'Анам жақтауды сабындады'}]
|
57 |
```
|
58 |
|
59 |
-
## Training
|
60 |
-
|
61 |
-
### Training Data
|
62 |
|
63 |
This model was trained on the following data (Russian-Kazakh language pairs):
|
64 |
-
[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
|
65 |
-
[kazparc](<https://huggingface.co/datasets/issai/kazparc>)
|
66 |
-
[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
|
67 |
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
Preprocessing of the data included:
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
#### Training
|
78 |
|
79 |
Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
|
80 |
|
81 |
## Evaluation
|
82 |
|
83 |
-
Current model was compared to another open-source translation model, NLLB. We compared our model to all version of
|
84 |
-
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
|
|
|
85 |
|
86 |
| Model | Size | BLEU | chrF | COMET |
|
87 |
|-----------------------------------------|-------|-----------------------------|------------------------|--------|
|
@@ -89,7 +87,7 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
|
|
89 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 0.8819 |
|
90 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 0.8843 |
|
91 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **0.8891** |
|
92 |
-
| [
|
93 |
|
94 |
## Examples of usage:
|
95 |
|
@@ -102,4 +100,16 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
|
|
102 |
|
103 |
>>> print(generate("Помогите мне удивить девушку"))
|
104 |
Қызды таң қалдыруға көмектесіңіз
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
```
|
|
|
12 |
language:
|
13 |
- ru
|
14 |
- kk
|
15 |
+
datasets:
|
16 |
+
- issai/kazparc
|
17 |
---
|
18 |
|
19 |
# kazRush-ru-kk
|
|
|
25 |
|
26 |
## Usage
|
27 |
|
28 |
+
Using the model requires `sentencepiece` library to be installed.
|
|
|
|
|
|
|
|
|
29 |
|
30 |
After installing necessary dependencies the model can be run with the following code:
|
31 |
|
|
|
33 |
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
34 |
import torch
|
35 |
|
36 |
+
device = 'cuda'
|
37 |
+
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk').to(device)
|
38 |
tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
|
39 |
|
40 |
+
@torch.inference_mode
|
41 |
def generate(text, **kwargs):
|
42 |
+
inputs = tokenizer(text, return_tensors='pt').to(device)
|
43 |
+
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
|
|
|
44 |
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
|
45 |
|
46 |
print(generate("Как Кока-Кола может помочь автомобилисту?"))
|
|
|
55 |
[{'translation_text': 'Анам жақтауды сабындады'}]
|
56 |
```
|
57 |
|
58 |
+
## Data and Training
|
|
|
|
|
59 |
|
60 |
This model was trained on the following data (Russian-Kazakh language pairs):
|
|
|
|
|
|
|
61 |
|
62 |
+
| Dataset | Number of pairs |
|
63 |
+
|-----------------------------------------|-------|
|
64 |
+
| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) | 718K |
|
65 |
+
| [kazparc](<https://huggingface.co/datasets/issai/kazparc>) | 2,150K |
|
66 |
+
| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) | 5,063K |
|
67 |
+
| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) | 4,403K |
|
68 |
|
69 |
Preprocessing of the data included:
|
70 |
+
1. deduplication
|
71 |
+
2. removing trash symbols, special tags, multiple whitespaces etc. from texts
|
72 |
+
3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
|
73 |
+
4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
|
74 |
+
5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
|
|
|
|
|
75 |
|
76 |
Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
|
77 |
|
78 |
## Evaluation
|
79 |
|
80 |
+
Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
|
81 |
+
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
|
82 |
+
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
|
83 |
|
84 |
| Model | Size | BLEU | chrF | COMET |
|
85 |
|-----------------------------------------|-------|-----------------------------|------------------------|--------|
|
|
|
87 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 0.8819 |
|
88 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 0.8843 |
|
89 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **0.8891** |
|
90 |
+
| [This model]() | 197M | **16.2** | **51.8** | 0.8836 |
|
91 |
|
92 |
## Examples of usage:
|
93 |
|
|
|
100 |
|
101 |
>>> print(generate("Помогите мне удивить девушку"))
|
102 |
Қызды таң қалдыруға көмектесіңіз
|
103 |
+
```
|
104 |
+
|
105 |
+
## Citations
|
106 |
+
|
107 |
+
```
|
108 |
+
@misc{deepvk2024kazRushrukk,
|
109 |
+
title={kazRush-ru-kk: translation model from Russian to Kazakh},
|
110 |
+
author={Lebedeva, Anna and Sokolov, Andrey},
|
111 |
+
url={hhttps://huggingface.co/deepvk/kazRush-kk-ru},
|
112 |
+
publisher={Hugging Face}
|
113 |
+
year={2024},
|
114 |
+
}
|
115 |
```
|