murodbek commited on
Commit
07e7719
1 Parent(s): 72cf81f

changing citation and some minor changes

Browse files
Files changed (1) hide show
  1. README.md +22 -11
README.md CHANGED
@@ -22,11 +22,16 @@ This repository contains a collection of machine translation models for the Kara
22
 
23
  We provide three variants of our Karakalpak translation model:
24
 
25
- | Model | Base Model | Parameters | Tokenizer Length | Datasets | Languages |
26
- |-------|------------|------------|-------------------|----------|-----------|
27
- | [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 615M | 256,204 | [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) | Karakalpak, Uzbek, Russian, English |
28
- | [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 629M | 269,399 | [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) | Karakalpak, Uzbek, Russian, English |
29
- | **[`dilmash-TIL`](https://huggingface.co/tahrirchi/dilmash-TIL)** | **[nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)** | **629M** | **269,399** | **[Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash), TIL corpus** | **Karakalpak, Uzbek, Russian, English** |
 
 
 
 
 
30
 
31
  ## Intended uses & limitations
32
 
@@ -67,18 +72,21 @@ The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmas
67
 
68
  ## Training procedure
69
 
70
- For full details of the training procedure, please refer to our paper (coming soon!).
71
 
72
  ## Citation
73
 
74
  If you use these models in your research, please cite our paper:
75
 
76
  ```bibtex
77
- @inproceedings{mamasaidov2024advancing,
78
- title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
79
- author={Mamasaidov, Mukhammadsaid and Shopulatov, Abror},
80
- booktitle={Proceedings of the OLDI Workshop},
81
- year={2024}
 
 
 
82
  }
83
  ```
84
 
@@ -92,6 +100,9 @@ We are thankful to these awesome organizations and people for helping to make it
92
  - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
93
  - Ajiniyaz Nurniyazov: for advise throughout the process
94
 
 
 
 
95
  ## Contacts
96
 
97
  We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
 
22
 
23
  We provide three variants of our Karakalpak translation model:
24
 
25
+ | Model | Tokenizer Length | Parameter Count | Unique Features |
26
+ |-------|------------|-------------------|-----------------|
27
+ | [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
28
+ | [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
29
+ | [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |
30
+
31
+ **Common attributes:**
32
+ - **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
33
+ - **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
34
+ - **Languages:** Karakalpak, Uzbek, Russian, English
35
 
36
  ## Intended uses & limitations
37
 
 
72
 
73
  ## Training procedure
74
 
75
+ For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).
76
 
77
  ## Citation
78
 
79
  If you use these models in your research, please cite our paper:
80
 
81
  ```bibtex
82
+ @misc{mamasaidov2024openlanguagedatainitiative,
83
+ title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
84
+ author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
85
+ year={2024},
86
+ eprint={2409.04269},
87
+ archivePrefix={arXiv},
88
+ primaryClass={cs.CL},
89
+ url={https://arxiv.org/abs/2409.04269},
90
  }
91
  ```
92
 
 
100
  - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
101
  - Ajiniyaz Nurniyazov: for advise throughout the process
102
 
103
+ We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
104
+
105
+
106
  ## Contacts
107
 
108
  We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.