|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
datasets: |
|
- tahrirchi/dilmash |
|
tags: |
|
- nllb |
|
- karakalpak |
|
language: |
|
- en |
|
- ru |
|
- uz |
|
- kaa |
|
base_model: facebook/nllb-200-distilled-600M |
|
pipeline_tag: translation |
|
--- |
|
# Dilmash: Karakalpak Machine Translation Models |
|
|
|
This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak". |
|
|
|
## Model variations |
|
|
|
We provide three variants of our Karakalpak translation model: |
|
|
|
| Model | Tokenizer Length | Parameter Count | Unique Features | |
|
|-------|------------|-------------------|-----------------| |
|
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer | |
|
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer | |
|
| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** | |
|
|
|
**Common attributes:** |
|
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) |
|
- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) |
|
- **Languages:** Karakalpak, Uzbek, Russian, English |
|
|
|
## Intended uses & limitations |
|
|
|
These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English. |
|
|
|
### How to use |
|
|
|
You can use these models with the Transformers library. Here's a quick example: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
model_ckpt = "tahrirchi/dilmash-til" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_ckpt) |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt) |
|
|
|
# Example translation |
|
input_text = "Here is dilmash translation model." |
|
|
|
tokenizer.src_lang = "eng_Latn" |
|
tokenizer.tgt_lang = "kaa_Latn" |
|
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(translated_text) # Dilmash awdarması modeli. |
|
``` |
|
|
|
## Training data |
|
|
|
The models were trained on a parallel corpus of 300,000 sentence pairs, including: |
|
- Uzbek-Karakalpak (100,000 pairs) |
|
- Russian-Karakalpak (100,000 pairs) |
|
- English-Karakalpak (100,000 pairs) |
|
|
|
The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash). |
|
|
|
## Training procedure |
|
|
|
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269). |
|
|
|
## Citation |
|
|
|
If you use these models in your research, please cite our paper: |
|
|
|
```bibtex |
|
@misc{mamasaidov2024openlanguagedatainitiative, |
|
title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, |
|
author={Mukhammadsaid Mamasaidov and Abror Shopulatov}, |
|
year={2024}, |
|
eprint={2409.04269}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.04269}, |
|
} |
|
``` |
|
|
|
## Gratitude |
|
|
|
We are thankful to these awesome organizations and people for helping to make it happen: |
|
|
|
- [David Dalé](https://daviddale.ru): for advise throughout the process |
|
- Perizad Najimova: for expertise and assistance with the Karakalpak language |
|
- [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language |
|
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process |
|
- Ajiniyaz Nurniyazov: for advise throughout the process |
|
|
|
We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation. |
|
|
|
|
|
## Contacts |
|
|
|
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak. |
|
|
|
For further development and issues about the dataset, please use [email protected] or [email protected] to contact. |