ibraheemmoosa commited on
Commit
83798db
·
1 Parent(s): 2725cd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -4
README.md CHANGED
@@ -42,8 +42,8 @@ co2_eq_emissions:
42
 
43
  # XLMIndic Base Uniscript
44
 
45
- This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
46
- library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
47
  where you can transliterate your text and use it on our model on the inference widget.
48
 
49
  ## Model description
@@ -55,6 +55,7 @@ This model has the same configuration as the [ALBERT Base v2 model](https://hugg
55
  - 768 hidden dimension
56
  - 12 attention heads
57
  - 11M parameters
 
58
 
59
  ## Training data
60
 
@@ -80,17 +81,24 @@ These are the 14 languages we pretrain this model on:
80
 
81
  ## Transliteration
82
 
83
- The unique component of this model is that it takes in ISO-15919 transliterated text.
84
 
85
  The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
86
 
87
  For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
88
 
89
- An example of ISO-15919 transliteration for a piece of Bangla text is the following:
90
 
91
  **Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
 
92
  **Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
93
 
 
 
 
 
 
 
94
 
95
  ## Training procedure
96
 
@@ -140,6 +148,15 @@ To use this model you will need to first install the [Aksharamukha](https://pypi
140
  pip install aksharamukha
141
  ```
142
 
 
 
 
 
 
 
 
 
 
143
  Then you can use this model directly with a pipeline for masked language modeling:
144
 
145
  ```python
 
42
 
43
  # XLMIndic Base Uniscript
44
 
45
+ This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. **Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
46
+ library.** A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
47
  where you can transliterate your text and use it on our model on the inference widget.
48
 
49
  ## Model description
 
55
  - 768 hidden dimension
56
  - 12 attention heads
57
  - 11M parameters
58
+ - 512 sequence length
59
 
60
  ## Training data
61
 
 
81
 
82
  ## Transliteration
83
 
84
+ *The unique component of this model is that it takes in ISO-15919 transliterated text.*
85
 
86
  The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
87
 
88
  For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
89
 
90
+ An example of ISO-15919 transliteration for a piece of **Bangla** text is the following:
91
 
92
  **Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
93
+
94
  **Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
95
 
96
+ Another example for a piece of **Hindi** text is the following:
97
+
98
+ **Original:** "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
99
+
100
+ **Transliterated:** "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
101
+
102
 
103
  ## Training procedure
104
 
 
148
  pip install aksharamukha
149
  ```
150
 
151
+ Using this library you can transliterate any text wriiten in Indic scripts in the following way:
152
+ ```python
153
+ >>> from aksharamukha import transliterate
154
+ >>> text = "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
155
+ >>> transliterated_text = transliterate.process('autodetect', 'ISO', text)
156
+ >>> transliterated_text
157
+ "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
158
+ ```
159
+
160
  Then you can use this model directly with a pipeline for masked language modeling:
161
 
162
  ```python