dilmash-til / README.md
murodbek's picture
changing citation and some minor changes
07e7719 verified
---
library_name: transformers
license: cc-by-nc-4.0
datasets:
- tahrirchi/dilmash
tags:
- nllb
- karakalpak
language:
- en
- ru
- uz
- kaa
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
---
# Dilmash: Karakalpak Machine Translation Models
This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".
## Model variations
We provide three variants of our Karakalpak translation model:
| Model | Tokenizer Length | Parameter Count | Unique Features |
|-------|------------|-------------------|-----------------|
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |
**Common attributes:**
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
- **Languages:** Karakalpak, Uzbek, Russian, English
## Intended uses & limitations
These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.
### How to use
You can use these models with the Transformers library. Here's a quick example:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_ckpt = "tahrirchi/dilmash-til"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
# Example translation
input_text = "Here is dilmash translation model."
tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "kaa_Latn"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text) # Dilmash awdarması modeli.
```
## Training data
The models were trained on a parallel corpus of 300,000 sentence pairs, including:
- Uzbek-Karakalpak (100,000 pairs)
- Russian-Karakalpak (100,000 pairs)
- English-Karakalpak (100,000 pairs)
The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).
## Training procedure
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).
## Citation
If you use these models in your research, please cite our paper:
```bibtex
@misc{mamasaidov2024openlanguagedatainitiative,
title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
year={2024},
eprint={2409.04269},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.04269},
}
```
## Gratitude
We are thankful to these awesome organizations and people for helping to make it happen:
- [David Dalé](https://daviddale.ru): for advise throughout the process
- Perizad Najimova: for expertise and assistance with the Karakalpak language
- [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
- Ajiniyaz Nurniyazov: for advise throughout the process
We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
## Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
For further development and issues about the dataset, please use [email protected] or [email protected] to contact.