kazRush-ru-kk / README.md
Kartoshkina's picture
corrected typos in readme
9c2a0c6 verified
metadata
library_name: transformers
pipeline_tag: translation
tags:
  - transformers
  - translation
  - pytorch
  - russian
  - kazakh
license: apache-2.0
language:
  - ru
  - kk
datasets:
  - issai/kazparc

kazRush-ru-kk

kazRush-ru-kk is a translation model for translating from Russian to Kazakh. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.

Usage

Using the model requires sentencepiece library to be installed.

After installing necessary dependencies the model can be run with the following code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')

@torch.inference_mode
def generate(text, **kwargs):
    inputs = tokenizer(text, return_tensors='pt').to(device)
    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

print(generate("Как Кока-Кола может помочь автомобилисту?"))

You can also access the model via pipeline wrapper:

>>> from transformers import pipeline

>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
>>> pipe("Мама мыла раму")
[{'translation_text': 'Анам жақтауды сабындады'}]

Data and Training

This model was trained on the following data (Russian-Kazakh language pairs):

Dataset Number of pairs
OPUS Corpora 718K
kazparc 2,150K
wmt19 dataset 5,063K
TIL dataset 4,403K

Preprocessing of the data included:

  1. deduplication
  2. removing trash symbols, special tags, multiple whitespaces etc. from texts
  3. removing texts that were not in Russian or Kazakh (language detection was made via facebook/fasttext-language-identification)
  4. removing pairs that had low alingment score (comparison was performed via sentence-transformers/LaBSE)
  5. filtering the data using opusfilter tools

Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.

Evaluation

Current model was compared to another open-source translation model, NLLB. We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size. The metrics - BLEU, chrF and COMET - were calculated on devtest part of FLORES+ evaluation benchmark, most recent evaluation benchmark for multilingual machine translation.
Calculation of BLEU and chrF follows the standart implementation from sacreBLEU, and COMET is calculated using default model described in COMET repository.

Model Size BLEU chrF COMET
nllb-200-distilled-600M 600M 13.8 48.2 86.8
nllb-200-1.3B 1.3B 14.8 50.1 88.1
nllb-200-distilled-1.3B 1.3B 15.2 50.2 88.4
nllb-200-3.3B 3.3B 15.6 50.7 88.9
This model 197M 16.2 51.8 88.3

Examples of usage:

>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.

>>> print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.

>>> print(generate("Помогите мне удивить девушку"))
Қызды таң қалдыруға көмектесіңіз

Citations

@misc{deepvk2024kazRushrukk,
    title={kazRush-ru-kk: translation model from Russian to Kazakh},
    author={Lebedeva, Anna and  Sokolov, Andrey},
    url={https://huggingface.co/deepvk/kazRush-ru-kk},
    publisher={Hugging Face},
    year={2024},
}