kazRush-ru-kk / README.md
Kartoshkina's picture
corrected typos in readme
9c2a0c6 verified
---
library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh
license: apache-2.0
language:
- ru
- kk
datasets:
- issai/kazparc
---
# kazRush-ru-kk
kazRush-ru-kk is a translation model for translating from Russian to Kazakh. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.
## Usage
Using the model requires `sentencepiece` library to be installed.
After installing necessary dependencies the model can be run with the following code:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')
@torch.inference_mode
def generate(text, **kwargs):
inputs = tokenizer(text, return_tensors='pt').to(device)
hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
print(generate("Как Кока-Кола может помочь автомобилисту?"))
```
You can also access the model via _pipeline_ wrapper:
```python
>>> from transformers import pipeline
>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
>>> pipe("Мама мыла раму")
[{'translation_text': 'Анам жақтауды сабындады'}]
```
## Data and Training
This model was trained on the following data (Russian-Kazakh language pairs):
| Dataset | Number of pairs |
|-----------------------------------------|-------|
| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) | 718K |
| [kazparc](<https://huggingface.co/datasets/issai/kazparc>) | 2,150K |
| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) | 5,063K |
| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) | 4,403K |
Preprocessing of the data included:
1. deduplication
2. removing trash symbols, special tags, multiple whitespaces etc. from texts
3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools
Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.
## Evaluation
Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).
| Model | Size | BLEU | chrF | COMET |
|-----------------------------------------|-------|-----------------------------|------------------------|--------|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 13.8 | 48.2 | 86.8 |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 88.1 |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 88.4 |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **88.9** |
| This model | 197M | **16.2** | **51.8** | 88.3 |
## Examples of usage:
```python
>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.
>>> print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.
>>> print(generate("Помогите мне удивить девушку"))
Қызды таң қалдыруға көмектесіңіз
```
## Citations
```
@misc{deepvk2024kazRushrukk,
title={kazRush-ru-kk: translation model from Russian to Kazakh},
author={Lebedeva, Anna and Sokolov, Andrey},
url={https://huggingface.co/deepvk/kazRush-ru-kk},
publisher={Hugging Face},
year={2024},
}
```