changing citation and some minor changes
Browse files
README.md
CHANGED
@@ -22,11 +22,16 @@ This repository contains a collection of machine translation models for the Kara
|
|
22 |
|
23 |
We provide three variants of our Karakalpak translation model:
|
24 |
|
25 |
-
| Model |
|
26 |
-
|
27 |
-
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) |
|
28 |
-
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) |
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## Intended uses & limitations
|
32 |
|
@@ -67,18 +72,21 @@ The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmas
|
|
67 |
|
68 |
## Training procedure
|
69 |
|
70 |
-
For full details of the training procedure, please refer to our paper
|
71 |
|
72 |
## Citation
|
73 |
|
74 |
If you use these models in your research, please cite our paper:
|
75 |
|
76 |
```bibtex
|
77 |
-
@
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
|
|
|
|
|
|
82 |
}
|
83 |
```
|
84 |
|
@@ -92,6 +100,9 @@ We are thankful to these awesome organizations and people for helping to make it
|
|
92 |
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
|
93 |
- Ajiniyaz Nurniyazov: for advise throughout the process
|
94 |
|
|
|
|
|
|
|
95 |
## Contacts
|
96 |
|
97 |
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
|
|
|
22 |
|
23 |
We provide three variants of our Karakalpak translation model:
|
24 |
|
25 |
+
| Model | Tokenizer Length | Parameter Count | Unique Features |
|
26 |
+
|-------|------------|-------------------|-----------------|
|
27 |
+
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
|
28 |
+
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
|
29 |
+
| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |
|
30 |
+
|
31 |
+
**Common attributes:**
|
32 |
+
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
|
33 |
+
- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
|
34 |
+
- **Languages:** Karakalpak, Uzbek, Russian, English
|
35 |
|
36 |
## Intended uses & limitations
|
37 |
|
|
|
72 |
|
73 |
## Training procedure
|
74 |
|
75 |
+
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).
|
76 |
|
77 |
## Citation
|
78 |
|
79 |
If you use these models in your research, please cite our paper:
|
80 |
|
81 |
```bibtex
|
82 |
+
@misc{mamasaidov2024openlanguagedatainitiative,
|
83 |
+
title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
|
84 |
+
author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
|
85 |
+
year={2024},
|
86 |
+
eprint={2409.04269},
|
87 |
+
archivePrefix={arXiv},
|
88 |
+
primaryClass={cs.CL},
|
89 |
+
url={https://arxiv.org/abs/2409.04269},
|
90 |
}
|
91 |
```
|
92 |
|
|
|
100 |
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
|
101 |
- Ajiniyaz Nurniyazov: for advise throughout the process
|
102 |
|
103 |
+
We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
|
104 |
+
|
105 |
+
|
106 |
## Contacts
|
107 |
|
108 |
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
|