kazRush-ru-kk / README.md

corrected typos in readme

9c2a0c6 verified about 2 months ago

5.3 kB

	---
	library_name: transformers
	pipeline_tag: translation
	tags:
	- transformers
	- translation
	- pytorch
	- russian
	- kazakh

	license: apache-2.0
	language:
	- ru
	- kk
	datasets:
	- issai/kazparc
	---

	# kazRush-ru-kk

	kazRush-ru-kk is a translation model for translating from Russian to Kazakh. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.

	## Usage

	Using the model requires `sentencepiece` library to be installed.

	After installing necessary dependencies the model can be run with the following code:

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
	import torch

	device = 'cuda'
	model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
	tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')

	@torch.inference_mode
	def generate(text, **kwargs):
	inputs = tokenizer(text, return_tensors='pt').to(device)
	hypotheses = model.generate(inputs, num_beams=5, kwargs)
	return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

	print(generate("Как Кока-Кола может помочь автомобилисту?"))
	```

	You can also access the model via _pipeline_ wrapper:
	```python
	>>> from transformers import pipeline

	>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
	>>> pipe("Мама мыла раму")
	[{'translation_text': 'Анам жақтауды сабындады'}]
	```

	## Data and Training

	This model was trained on the following data (Russian-Kazakh language pairs):

	\| Dataset \| Number of pairs \|
	\|-----------------------------------------\|-------\|
	\| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>) \| 718K \|
	\| [kazparc](<https://huggingface.co/datasets/issai/kazparc>) \| 2,150K \|
	\| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>) \| 5,063K \|
	\| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>) \| 4,403K \|

	Preprocessing of the data included:
	1. deduplication
	2. removing trash symbols, special tags, multiple whitespaces etc. from texts
	3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
	4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
	5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools

	Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.

	## Evaluation

	Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
	The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
	Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).

	\| Model \| Size \| BLEU \| chrF \| COMET \|
	\|-----------------------------------------\|-------\|-----------------------------\|------------------------\|--------\|
	\| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) \| 600M \| 13.8 \| 48.2 \| 86.8 \|
	\| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) \| 1.3B \| 14.8 \| 50.1 \| 88.1 \|
	\| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) \| 1.3B \| 15.2 \| 50.2 \| 88.4 \|
	\| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) \| 3.3B \| 15.6 \| 50.7 \| 88.9 \|
	\| This model \| 197M \| 16.2 \| 51.8 \| 88.3 \|

	## Examples of usage:

	```python
	>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
	Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.

	>>> print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
	Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.

	>>> print(generate("Помогите мне удивить девушку"))
	Қызды таң қалдыруға көмектесіңіз
	```

	## Citations

	```
	@misc{deepvk2024kazRushrukk,
	title={kazRush-ru-kk: translation model from Russian to Kazakh},
	author={Lebedeva, Anna and Sokolov, Andrey},
	url={https://huggingface.co/deepvk/kazRush-ru-kk},
	publisher={Hugging Face},
	year={2024},
	}
	```