dilmash / README.md
murodbek's picture
changing citation and some minor changes
d52e5de verified
metadata
library_name: transformers
license: cc-by-nc-4.0
datasets:
  - tahrirchi/dilmash
tags:
  - nllb
  - karakalpak
language:
  - en
  - ru
  - uz
  - kaa
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation

Dilmash: Karakalpak Machine Translation Models

This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".

Model variations

We provide three variants of our Karakalpak translation model:

Model Tokenizer Length Parameter Count Unique Features
dilmash-raw 256,204 615M Original NLLB tokenizer
dilmash 269,399 629M Expanded tokenizer
dilmash-TIL 269,399 629M Additional TIL corpus

Common attributes:

Intended uses & limitations

These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.

How to use

You can use these models with the Transformers library. Here's a quick example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/dilmash"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "Here is dilmash translation model."

tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "kaa_Latn"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text) # Dilmash awdarması modeli.

Training data

The models were trained on a parallel corpus of 300,000 sentence pairs, including:

  • Uzbek-Karakalpak (100,000 pairs)
  • Russian-Karakalpak (100,000 pairs)
  • English-Karakalpak (100,000 pairs)

The dataset is available here.

Training procedure

For full details of the training procedure, please refer to our paper.

Citation

If you use these models in your research, please cite our paper:

@misc{mamasaidov2024openlanguagedatainitiative,
      title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, 
      author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
      year={2024},
      eprint={2409.04269},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.04269}, 
}

Gratitude

We are thankful to these awesome organizations and people for helping to make it happen:

  • David Dalé: for advise throughout the process
  • Perizad Najimova: for expertise and assistance with the Karakalpak language
  • Nurlan Pirjanov: for expertise and assistance with the Karakalpak language
  • Atabek Murtazaev: for advise throughout the process
  • Ajiniyaz Nurniyazov: for advise throughout the process

We would also like to express our sincere appreciation to Google for Startups for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.

Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.

For further development and issues about the dataset, please use [email protected] or [email protected] to contact.