library_name: transformers
license: cc-by-nc-4.0
datasets:
- tahrirchi/dilmash
tags:
- nllb
- karakalpak
language:
- en
- ru
- uz
- kaa
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
Dilmash: Karakalpak Machine Translation Models
This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".
Model variations
We provide three variants of our Karakalpak translation model:
Model | Tokenizer Length | Parameter Count | Unique Features |
---|---|---|---|
dilmash-raw |
256,204 | 615M | Original NLLB tokenizer |
dilmash |
269,399 | 629M | Expanded tokenizer |
dilmash-TIL |
269,399 | 629M | Additional TIL corpus |
Common attributes:
- Base Model: nllb-200-600M
- Primary Dataset: Dilmash corpus
- Languages: Karakalpak, Uzbek, Russian, English
Intended uses & limitations
These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.
How to use
You can use these models with the Transformers library. Here's a quick example:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_ckpt = "tahrirchi/dilmash-til"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)
# Example translation
input_text = "Here is dilmash translation model."
tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "kaa_Latn"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text) # Dilmash awdarması modeli.
Training data
The models were trained on a parallel corpus of 300,000 sentence pairs, including:
- Uzbek-Karakalpak (100,000 pairs)
- Russian-Karakalpak (100,000 pairs)
- English-Karakalpak (100,000 pairs)
The dataset is available here.
Training procedure
For full details of the training procedure, please refer to our paper.
Citation
If you use these models in your research, please cite our paper:
@misc{mamasaidov2024openlanguagedatainitiative,
title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
year={2024},
eprint={2409.04269},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.04269},
}
Gratitude
We are thankful to these awesome organizations and people for helping to make it happen:
- David Dalé: for advise throughout the process
- Perizad Najimova: for expertise and assistance with the Karakalpak language
- Nurlan Pirjanov: for expertise and assistance with the Karakalpak language
- Atabek Murtazaev: for advise throughout the process
- Ajiniyaz Nurniyazov: for advise throughout the process
We would also like to express our sincere appreciation to Google for Startups for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
Contacts
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
For further development and issues about the dataset, please use [email protected] or [email protected] to contact.