Kartoshkina commited on
Commit
2836b3f
·
1 Parent(s): fb7d5d2

readme update #4

Browse files
Files changed (1) hide show
  1. README.md +36 -26
README.md CHANGED
@@ -12,6 +12,8 @@ license: apache-2.0
12
  language:
13
  - ru
14
  - kk
 
 
15
  ---
16
 
17
  # kazRush-ru-kk
@@ -23,11 +25,7 @@ kazRush-ru-kk is a translation model for translating from Russian to Kazakh.
23
 
24
  ## Usage
25
 
26
- Using the model requires some packages to be installed.
27
-
28
- ```bash
29
- pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
30
- ```
31
 
32
  After installing necessary dependencies the model can be run with the following code:
33
 
@@ -35,13 +33,14 @@ After installing necessary dependencies the model can be run with the following
35
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
36
  import torch
37
 
38
- model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk')
 
39
  tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
40
 
 
41
  def generate(text, **kwargs):
42
- inputs = tokenizer(text, return_tensors='pt').to('cuda')
43
- with torch.no_grad():
44
- hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
45
  return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
46
 
47
  print(generate("Как Кока-Кола может помочь автомобилисту?"))
@@ -56,32 +55,31 @@ You can also access the model via _pipeline_ wrapper:
56
  [{'translation_text': 'Анам жақтауды сабындады'}]
57
  ```
58
 
59
- ## Training Details
60
-
61
- ### Training Data
62
 
63
  This model was trained on the following data (Russian-Kazakh language pairs):
64
- [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
65
- [kazparc](<https://huggingface.co/datasets/issai/kazparc>)
66
- [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
67
 
68
- #### Preprocessing
 
 
 
 
 
69
 
70
  Preprocessing of the data included:
71
- - deduplication;
72
- - removing trash symbols, special tags, multiple whitespaces etc. from texts;
73
- - removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
74
- - removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
75
- - filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
76
-
77
- #### Training
78
 
79
  Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
80
 
81
  ## Evaluation
82
 
83
- Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
84
- The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
 
85
 
86
  | Model | Size | BLEU | chrF | COMET |
87
  |-----------------------------------------|-------|-----------------------------|------------------------|--------|
@@ -89,7 +87,7 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
89
  | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 0.8819 |
90
  | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 0.8843 |
91
  | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **0.8891** |
92
- | [our model (kzgqkn0f)]() | 196 M | **16.2** | **51.8** | 0.8836 |
93
 
94
  ## Examples of usage:
95
 
@@ -102,4 +100,16 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
102
 
103
  >>> print(generate("Помогите мне удивить девушку"))
104
  Қызды таң қалдыруға көмектесіңіз
 
 
 
 
 
 
 
 
 
 
 
 
105
  ```
 
12
  language:
13
  - ru
14
  - kk
15
+ datasets:
16
+ - issai/kazparc
17
  ---
18
 
19
  # kazRush-ru-kk
 
25
 
26
  ## Usage
27
 
28
+ Using the model requires `sentencepiece` library to be installed.
 
 
 
 
29
 
30
  After installing necessary dependencies the model can be run with the following code:
31
 
 
33
  from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
34
  import torch
35
 
36
+ device = 'cuda'
37
+ model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk').to(device)
38
  tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
39
 
40
+ @torch.inference_mode
41
  def generate(text, **kwargs):
42
+ inputs = tokenizer(text, return_tensors='pt').to(device)
43
+ hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
 
44
  return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
45
 
46
  print(generate("Как Кока-Кола может помочь автомобилисту?"))
 
55
  [{'translation_text': 'Анам жақтауды сабындады'}]
56
  ```
57
 
58
+ ## Data and Training
 
 
59
 
60
  This model was trained on the following data (Russian-Kazakh language pairs):
 
 
 
61
 
62
+ | Dataset | Number of pairs |
63
+ |-----------------------------------------|-------|
64
+ | [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) | 718K |
65
+ | [kazparc](<https://huggingface.co/datasets/issai/kazparc>) | 2,150K |
66
+ | [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) | 5,063K |
67
+ | [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) | 4,403K |
68
 
69
  Preprocessing of the data included:
70
+ 1. deduplication
71
+ 2. removing trash symbols, special tags, multiple whitespaces etc. from texts
72
+ 3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
73
+ 4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
74
+ 5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
 
 
75
 
76
  Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
77
 
78
  ## Evaluation
79
 
80
+ Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
81
+ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
82
+ Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
83
 
84
  | Model | Size | BLEU | chrF | COMET |
85
  |-----------------------------------------|-------|-----------------------------|------------------------|--------|
 
87
  | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 0.8819 |
88
  | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 0.8843 |
89
  | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **0.8891** |
90
+ | [This model]() | 197M | **16.2** | **51.8** | 0.8836 |
91
 
92
  ## Examples of usage:
93
 
 
100
 
101
  >>> print(generate("Помогите мне удивить девушку"))
102
  Қызды таң қалдыруға көмектесіңіз
103
+ ```
104
+
105
+ ## Citations
106
+
107
+ ```
108
+ @misc{deepvk2024kazRushrukk,
109
+ title={kazRush-ru-kk: translation model from Russian to Kazakh},
110
+ author={Lebedeva, Anna and Sokolov, Andrey},
111
+ url={hhttps://huggingface.co/deepvk/kazRush-kk-ru},
112
+ publisher={Hugging Face}
113
+ year={2024},
114
+ }
115
  ```