deepvk
/

kazRush-ru-kk

@@ -12,6 +12,8 @@ license: apache-2.0
 language:
 - ru
 - kk
 ---
 # kazRush-ru-kk
@@ -23,11 +25,7 @@ kazRush-ru-kk is a translation model for translating from Russian to Kazakh.
 ## Usage
-Using the model requires some packages to be installed.
-```bash
-pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
-```
 After installing necessary dependencies the model can be run with the following code:
@@ -35,13 +33,14 @@ After installing necessary dependencies the model can be run with the following
 from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 import torch
-model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk')
 tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
 def generate(text, **kwargs):
-    inputs = tokenizer(text, return_tensors='pt').to('cuda')
-    with torch.no_grad():
-        hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
     return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
 print(generate("Как Кока-Кола может помочь автомобилисту?"))
@@ -56,32 +55,31 @@ You can also access the model via _pipeline_ wrapper:
 [{'translation_text': 'Анам жақтауды сабындады'}]
 ```
-## Training Details
-### Training Data
 This model was trained on the following data (Russian-Kazakh language pairs):
-[OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
-[kazparc](<https://huggingface.co/datasets/issai/kazparc>)
-[wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
-#### Preprocessing
 Preprocessing of the data included:
-- deduplication;
-- removing trash symbols, special tags, multiple whitespaces etc. from texts;
-- removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
-- removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
-- filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
-#### Training
 Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
 ## Evaluation
-Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
-The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
 | Model  | Size | BLEU | chrF | COMET |
 |-----------------------------------------|-------|-----------------------------|------------------------|--------|
@@ -89,7 +87,7 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
 | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)                   | 1.3B   | 14.8 | 50.1  | 0.8819  |
 | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)   | 1.3B    | 15.2 | 50.2 | 0.8843   |
 | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)                    | 3.3B    | 15.6 | 50.7  | **0.8891**    |
-| [our model (kzgqkn0f)]()                             | 196 M    | **16.2**  | **51.8**   |   0.8836   |
 ## Examples of usage:
@@ -102,4 +100,16 @@ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORE
 >>> print(generate("Помогите мне удивить девушку"))
 Қызды таң қалдыруға көмектесіңіз
 ```

 language:
 - ru
 - kk
+datasets:
+- issai/kazparc
 ---
 # kazRush-ru-kk
 ## Usage
+Using the model requires `sentencepiece` library to be installed.
 After installing necessary dependencies the model can be run with the following code:
 from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 import torch
+device = 'cuda'
+model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk').to(device)
 tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
+@torch.inference_mode
 def generate(text, **kwargs):
+    inputs = tokenizer(text, return_tensors='pt').to(device)
+    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
     return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
 print(generate("Как Кока-Кола может помочь автомобилисту?"))
 [{'translation_text': 'Анам жақтауды сабындады'}]
 ```
+## Data and Training
 This model was trained on the following data (Russian-Kazakh language pairs):
+| Dataset  | Number of pairs |
+|-----------------------------------------|-------|
+| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)     | 718K   |
+| [kazparc](<https://huggingface.co/datasets/issai/kazparc>)            | 2,150K   |
+| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)                  | 5,063K   |
+| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>)                  | 4,403K   |
 Preprocessing of the data included:
+1. deduplication
+2. removing trash symbols, special tags, multiple whitespaces etc. from texts
+3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
+4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
+5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
 Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
 ## Evaluation
+Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
+The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
+Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
 | Model  | Size | BLEU | chrF | COMET |
 |-----------------------------------------|-------|-----------------------------|------------------------|--------|
 | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)                   | 1.3B   | 14.8 | 50.1  | 0.8819  |
 | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)   | 1.3B    | 15.2 | 50.2 | 0.8843   |
 | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)                    | 3.3B    | 15.6 | 50.7  | **0.8891**    |
+| [This model]()                             | 197M    | **16.2**  | **51.8**   |   0.8836   |
 ## Examples of usage:
 >>> print(generate("Помогите мне удивить девушку"))
 Қызды таң қалдыруға көмектесіңіз
+```
+## Citations
+```
+@misc{deepvk2024kazRushrukk,
+    title={kazRush-ru-kk: translation model from Russian to Kazakh},
+    author={Lebedeva, Anna and  Sokolov, Andrey},
+    url={hhttps://huggingface.co/deepvk/kazRush-kk-ru},
+    publisher={Hugging Face}
+    year={2024},
+}
 ```