metadata

library_name: transformers
license: cc-by-nc-4.0
datasets:
  - tahrirchi/dilmash
tags:
  - nllb
  - karakalpak
language:
  - en
  - ru
  - uz
  - kaa
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation

Dilmash: Karakalpak Machine Translation Models

This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".

Model variations

We provide three variants of our Karakalpak translation model:

Model	Tokenizer Length	Parameter Count	Unique Features
`dilmash-raw`	256,204	615M	Original NLLB tokenizer
`dilmash`	269,399	629M	Expanded tokenizer
`dilmash-TIL`	269,399	629M	Additional TIL corpus

Common attributes:

Base Model: nllb-200-600M
Primary Dataset: Dilmash corpus
Languages: Karakalpak, Uzbek, Russian, English

Intended uses & limitations

These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.

How to use

You can use these models with the Transformers library. Here's a quick example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/dilmash"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "Here is dilmash translation model."

tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "kaa_Latn"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text) # Dilmash awdarması modeli.

Training data

The models were trained on a parallel corpus of 300,000 sentence pairs, including:

Uzbek-Karakalpak (100,000 pairs)
Russian-Karakalpak (100,000 pairs)
English-Karakalpak (100,000 pairs)

The dataset is available here.

Training procedure

For full details of the training procedure, please refer to our paper.

Citation

If you use these models in your research, please cite our paper:

@misc{mamasaidov2024openlanguagedatainitiative,
      title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, 
      author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
      year={2024},
      eprint={2409.04269},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.04269}, 
}

Gratitude

We are thankful to these awesome organizations and people for helping to make it happen:

David Dalé: for advise throughout the process
Perizad Najimova: for expertise and assistance with the Karakalpak language
Nurlan Pirjanov: for expertise and assistance with the Karakalpak language
Atabek Murtazaev: for advise throughout the process
Ajiniyaz Nurniyazov: for advise throughout the process

We would also like to express our sincere appreciation to Google for Startups for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.

Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.

For further development and issues about the dataset, please use [email protected] or [email protected] to contact.