--- library_name: transformers pipeline_tag: translation tags: - transformers - translation - pytorch - russian - kazakh license: apache-2.0 language: - ru - kk datasets: - issai/kazparc --- # kazRush-ru-kk kazRush-ru-kk is a translation model for translating from Russian to Kazakh. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data. ## Usage Using the model requires `sentencepiece` library to be installed. After installing necessary dependencies the model can be run with the following code: ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer import torch device = 'cuda' model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device) tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk') @torch.inference_mode def generate(text, **kwargs): inputs = tokenizer(text, return_tensors='pt').to(device) hypotheses = model.generate(**inputs, num_beams=5, **kwargs) return tokenizer.decode(hypotheses[0], skip_special_tokens=True) print(generate("Как Кока-Кола может помочь автомобилисту?")) ``` You can also access the model via _pipeline_ wrapper: ```python >>> from transformers import pipeline >>> pipe = pipeline(model="deepvk/kazRush-ru-kk") >>> pipe("Мама мыла раму") [{'translation_text': 'Анам жақтауды сабындады'}] ``` ## Data and Training This model was trained on the following data (Russian-Kazakh language pairs): | Dataset | Number of pairs | |-----------------------------------------|-------| | [OPUS Corpora]() | 718K | | [kazparc]() | 2,150K | | [wmt19 dataset]() | 5,063K | | [TIL dataset]() | 4,403K | Preprocessing of the data included: 1. deduplication 2. removing trash symbols, special tags, multiple whitespaces etc. from texts 3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification]()) 4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE]()) 5. filtering the data using [opusfilter]() tools Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb. ## Evaluation Current model was compared to another open-source translation model, [NLLB](). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size. The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](), most recent evaluation benchmark for multilingual machine translation. Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](), and COMET is calculated using default model described in [COMET repository](). | Model | Size | BLEU | chrF | COMET | |-----------------------------------------|-------|-----------------------------|------------------------|--------| | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 13.8 | 48.2 | 86.8 | | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 88.1 | | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 88.4 | | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | **88.9** | | This model | 197M | **16.2** | **51.8** | 88.3 | ## Examples of usage: ```python >>> print(generate("Каждый охотник желает знать, где сидит фазан.")) Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді. >>> print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек.")) Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады. >>> print(generate("Помогите мне удивить девушку")) Қызды таң қалдыруға көмектесіңіз ``` ## Citations ``` @misc{deepvk2024kazRushrukk, title={kazRush-ru-kk: translation model from Russian to Kazakh}, author={Lebedeva, Anna and Sokolov, Andrey}, url={https://huggingface.co/deepvk/kazRush-ru-kk}, publisher={Hugging Face}, year={2024}, } ```