|
--- |
|
language: |
|
- kk |
|
- tr |
|
- ru |
|
- en |
|
language_details: eng_Latn, kaz_Cyrl, rus_Cyrl, tur_Latn |
|
metrics: |
|
- bleu |
|
- chrf |
|
pipeline_tag: translation |
|
inference: false |
|
datasets: |
|
- facebook/flores |
|
- issai/kazparc |
|
--- |
|
|
|
# Tilmash |
|
|
|
<p align = "justify"> |
|
Tilmash was fine-tuned using Facebook’s <a href = "https://huggingface.co/facebook/nllb-200-distilled-1.3B">NLLB</a> model to enable machine translation for four languages—Kazakh, Russian, English, and Turkish. |
|
Below are the <a href = "https://huggingface.co/spaces/evaluate-metric/bleu">BLEU</a> | <a href = "https://huggingface.co/spaces/evaluate-metric/chrf">chrF</a> results of evaluating Tilmash on the <a href = "https://huggingface.co/datasets/facebook/flores">FLoRes</a> and <a href = "https://huggingface.co/datasets/issai/kazparc">KazParC</a> test datasets. |
|
</p> |
|
|
|
<table align = "center"> |
|
<thead align = "center"> |
|
<tr> |
|
<th>Pair</th> |
|
<th>FLoRes</th> |
|
<th>KazParC</th> |
|
</tr> |
|
</thead> |
|
<tbody align = "center"> |
|
<tr> |
|
<td>EN↔KK</td> |
|
<td>0.20 | 0.60</td> |
|
<td>0.21 | 0.60</td> |
|
</tr> |
|
<tr> |
|
<td>EN↔RU</td> |
|
<td>0.28 | 0.60</td> |
|
<td>0.38 | 0.68</td> |
|
</tr> |
|
<tr> |
|
<td>EN↔TR</td> |
|
<td>0.27 | 0.65</td> |
|
<td>0.25 | 0.64</td> |
|
</tr> |
|
<tr> |
|
<td>KK↔EN</td> |
|
<td>0.32 | 0.63</td> |
|
<td>0.32 | 0.62</td> |
|
</tr> |
|
<tr> |
|
<td>KK↔RU</td> |
|
<td>0.18 | 0.52</td> |
|
<td>0.29 | 0.63</td> |
|
</tr> |
|
<tr> |
|
<td>KK↔TR</td> |
|
<td>0.14 | 0.54</td> |
|
<td>0.16 | 0.55</td> |
|
</tr> |
|
<tr> |
|
<td>RU↔EN</td> |
|
<td>0.32 | 0.63</td> |
|
<td>0.42 | 0.70</td> |
|
</tr> |
|
<tr> |
|
<td>RU↔KK</td> |
|
<td>0.13 | 0.54</td> |
|
<td>0.22 | 0.62</td> |
|
</tr> |
|
<tr> |
|
<td>RU↔TR</td> |
|
<td>0.14 | 0.54</td> |
|
<td>0.18 | 0.57</td> |
|
</tr> |
|
<tr> |
|
<td>TR↔EN</td> |
|
<td>0.36 | 0.66</td> |
|
<td>0.38 | 0.66</td> |
|
</tr> |
|
<tr> |
|
<td>TR↔KK</td> |
|
<td>0.13 | 0.54</td> |
|
<td>0.16 | 0.55</td> |
|
</tr> |
|
<tr> |
|
<td>TR↔RU</td> |
|
<td>0.19 | 0.53</td> |
|
<td>0.24 | 0.57</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
## Model Sources |
|
|
|
- **Repository:** <a href = "https://github.com/IS2AI/KazParC">https://github.com/IS2AI/KazParC</a> |
|
- **Paper:** <a href = "there_will_be_a_link_soon">KazParC: Kazakh Parallel Corpus for Machine Translation</a> |
|
- **Demo:** <a href = "https://issai.nu.edu.kz/tilmash/">Tilmash Demo</a> |
|
|
|
## How to Get Started with the Model |
|
|
|
<p align = "justify">You can use this model with the Transformers pipeline for translation.</p> |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TranslationPipeline |
|
|
|
model = AutoModelForSeq2SeqLM.from_pretrained('issai/tilmash') |
|
tokenizer = AutoTokenizer.from_pretrained("issai/tilmash") |
|
|
|
# for src_lang and tgt_lang choose from kaz_Cyrl (Kazakh), rus_Cyrl (Russian), eng_Latn (English), tur_Latn (Turkish) |
|
tilmash = TranslationPipeline(model = model, tokenizer = tokenizer, src_lang = "kaz_Cyrl", tgt_lang = "eng_Latn", max_length = 1000) |
|
|
|
print(tilmash("Қазақстан — Шығыс Еуропа мен Орталық Азияда орналасқан мемлекет.")) |
|
# [{'translation_text': 'Kazakhstan is a country located in Eastern Europe and Central Asia.'}] |
|
``` |