ibraheemmoosa
commited on
Commit
·
83798db
1
Parent(s):
2725cd9
Update README.md
Browse files
README.md
CHANGED
@@ -42,8 +42,8 @@ co2_eq_emissions:
|
|
42 |
|
43 |
# XLMIndic Base Uniscript
|
44 |
|
45 |
-
This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
|
46 |
-
library
|
47 |
where you can transliterate your text and use it on our model on the inference widget.
|
48 |
|
49 |
## Model description
|
@@ -55,6 +55,7 @@ This model has the same configuration as the [ALBERT Base v2 model](https://hugg
|
|
55 |
- 768 hidden dimension
|
56 |
- 12 attention heads
|
57 |
- 11M parameters
|
|
|
58 |
|
59 |
## Training data
|
60 |
|
@@ -80,17 +81,24 @@ These are the 14 languages we pretrain this model on:
|
|
80 |
|
81 |
## Transliteration
|
82 |
|
83 |
-
The unique component of this model is that it takes in ISO-15919 transliterated text
|
84 |
|
85 |
The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
|
86 |
|
87 |
For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
|
88 |
|
89 |
-
An example of ISO-15919 transliteration for a piece of Bangla text is the following:
|
90 |
|
91 |
**Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
|
|
|
92 |
**Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
## Training procedure
|
96 |
|
@@ -140,6 +148,15 @@ To use this model you will need to first install the [Aksharamukha](https://pypi
|
|
140 |
pip install aksharamukha
|
141 |
```
|
142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
Then you can use this model directly with a pipeline for masked language modeling:
|
144 |
|
145 |
```python
|
|
|
42 |
|
43 |
# XLMIndic Base Uniscript
|
44 |
|
45 |
+
This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. **Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
|
46 |
+
library.** A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
|
47 |
where you can transliterate your text and use it on our model on the inference widget.
|
48 |
|
49 |
## Model description
|
|
|
55 |
- 768 hidden dimension
|
56 |
- 12 attention heads
|
57 |
- 11M parameters
|
58 |
+
- 512 sequence length
|
59 |
|
60 |
## Training data
|
61 |
|
|
|
81 |
|
82 |
## Transliteration
|
83 |
|
84 |
+
*The unique component of this model is that it takes in ISO-15919 transliterated text.*
|
85 |
|
86 |
The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
|
87 |
|
88 |
For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
|
89 |
|
90 |
+
An example of ISO-15919 transliteration for a piece of **Bangla** text is the following:
|
91 |
|
92 |
**Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
|
93 |
+
|
94 |
**Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
|
95 |
|
96 |
+
Another example for a piece of **Hindi** text is the following:
|
97 |
+
|
98 |
+
**Original:** "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
|
99 |
+
|
100 |
+
**Transliterated:** "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
|
101 |
+
|
102 |
|
103 |
## Training procedure
|
104 |
|
|
|
148 |
pip install aksharamukha
|
149 |
```
|
150 |
|
151 |
+
Using this library you can transliterate any text wriiten in Indic scripts in the following way:
|
152 |
+
```python
|
153 |
+
>>> from aksharamukha import transliterate
|
154 |
+
>>> text = "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
|
155 |
+
>>> transliterated_text = transliterate.process('autodetect', 'ISO', text)
|
156 |
+
>>> transliterated_text
|
157 |
+
"cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
|
158 |
+
```
|
159 |
+
|
160 |
Then you can use this model directly with a pipeline for masked language modeling:
|
161 |
|
162 |
```python
|