ibraheemmoosa
/

xlmindic-base-uniscript

@@ -42,8 +42,8 @@ co2_eq_emissions:
 # XLMIndic Base Uniscript
-This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
-library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
 where you can transliterate your text and use it on our model on the inference widget.
 ## Model description
@@ -55,6 +55,7 @@ This model has the same configuration as the [ALBERT Base v2 model](https://hugg
 - 768 hidden dimension
 - 12 attention heads
 - 11M parameters
 ## Training data
@@ -80,17 +81,24 @@ These are the 14 languages we pretrain this model on:
 ## Transliteration
-The unique component of this model is that it takes in ISO-15919 transliterated text.
 The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
 For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
-An example of ISO-15919 transliteration for a piece of Bangla text is the following:
 **Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
 **Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
 ## Training procedure
@@ -140,6 +148,15 @@ To use this model you will need to first install the [Aksharamukha](https://pypi
 pip install aksharamukha
 ```
 Then you can use this model directly with a pipeline for masked language modeling:
 ```python

 # XLMIndic Base Uniscript
+This model is pretrained on a subset of the [OSCAR](https://huggingface.co/datasets/oscar) corpus spanning 14 Indo-Aryan languages. **Before pretraining this model we transliterate the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
+library.** A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
 where you can transliterate your text and use it on our model on the inference widget.
 ## Model description
 - 768 hidden dimension
 - 12 attention heads
 - 11M parameters
+- 512 sequence length
 ## Training data
 ## Transliteration
+*The unique component of this model is that it takes in ISO-15919 transliterated text.*
 The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
 For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
+An example of ISO-15919 transliteration for a piece of **Bangla** text is the following:
 **Original:** "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক।"
 **Transliterated:** 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika.'
+Another example for a piece of **Hindi** text is the following:
+**Original:** "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
+**Transliterated:** "cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
 ## Training procedure
 pip install aksharamukha
 ```
+Using this library you can transliterate any text wriiten in Indic scripts in the following way:
+```python
+>>> from aksharamukha import transliterate
+>>> text = "चूंकि मानव परिवार के सभी सदस्यों के जन्मजात गौरव और समान तथा अविच्छिन्न अधिकार की स्वीकृति ही विश्व-शान्ति, न्याय और स्वतन्त्रता की बुनियाद है"
+>>> transliterated_text = transliterate.process('autodetect', 'ISO', text)
+>>> transliterated_text
+"cūṁki mānava parivāra kē sabhī sadasyōṁ kē janmajāta gaurava aura samāna tathā avicchinna adhikāra kī svīkr̥ti hī viśva-śānti, nyāya aura svatantratā kī buniyāda hai"
+```
 Then you can use this model directly with a pipeline for masked language modeling:
 ```python