|
--- |
|
language: |
|
- as |
|
- bn |
|
- gu |
|
- hi |
|
- mr |
|
- ne |
|
- or |
|
- pa |
|
- si |
|
- sa |
|
- bpy |
|
- mai |
|
- bh |
|
- gom |
|
license: apache-2.0 |
|
datasets: |
|
- oscar |
|
tags: |
|
- multilingual |
|
- albert |
|
- masked-language-modeling |
|
- sentence-order-prediction |
|
- fill-mask |
|
- xlmindic |
|
- exbert |
|
- nlp |
|
widget: |
|
- text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.' |
|
|
|
co2_eq_emissions: |
|
emissions: "28.53 in grams of CO2" |
|
source: "calculated using this webstie https://mlco2.github.io/impact/#compute" |
|
training_type: "pretraining" |
|
geographical_location: "NA" |
|
hardware_used: "TPUv3-8 for about 180 hours or 7.5 days" |
|
--- |
|
|
|
# XLMIndic Base Uniscript |
|
|
|
Pretrained [ALBERT](https://arxiv.org/abs/1909.11942) model on the [OSCAR](https://huggingface.co/datasets/oscar) corpus on the 14 Indo-Aryan languages. This model was pretrained after transliterating the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/) |
|
library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter) |
|
where you can transliterate your text and use it on our model on the inference widget. |
|
|
|
## Model description |
|
|
|
This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration: |
|
|
|
- 12 repeating layers |
|
- 128 embedding dimension |
|
- 768 hidden dimension |
|
- 12 attention heads |
|
- 11M parameters |
|
|
|
## Training data |
|
|
|
This model was pretrained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset which is a medium sized multilingual corpus containing text from 163 languages. We select a subset of 14 languages based on the following criteria: |
|
- Belongs to the [Indo-Aryan language family](https://en.wikipedia.org/wiki/Indo-Aryan_languages). |
|
- Uses a [Brahmic script](https://en.wikipedia.org/wiki/Brahmic_scripts). |
|
|
|
These are the 14 languages we pretrain this model on: |
|
- Assamese |
|
- Bangla |
|
- Bihari |
|
- Bishnupriya Manipuri |
|
- Goan Konkani |
|
- Gujarati |
|
- Hindi |
|
- Maithili |
|
- Marathi |
|
- Nepali |
|
- Oriya |
|
- Panjabi |
|
- Sanskrit |
|
- Sinhala. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The texts are transliterated to ISO-15919 format using the Aksharamukha library. Then these are tokenized using SentencePiece and a vocabulary size of 50,000. The inputs of the model are |
|
then of the form: |
|
``` |
|
[CLS] Sentence A [SEP] Sentence B [SEP] |
|
``` |
|
|
|
### Training |
|
|
|
Training objective is the same as the original ALBERT. |
|
. |
|
The details of the masking procedure for each sentence are the following: |
|
- 15% of the tokens are masked. |
|
- In 80% of the cases, the masked tokens are replaced by `[MASK]`. |
|
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace. |
|
- In the 10% remaining cases, the masked tokens are left as is. |
|
|
|
The details of the sentence order prediction example generation procedure for each sentence are the following: |
|
- Split the sentence into two parts A and B at a random index. |
|
- With 50% probability swap the two parts. |
|
|
|
The model was pretrained on TPUv3-8 for 1M steps. We have checkpoints available every 10k steps. We will upload these in the future. |
|
|
|
## Evaluation results |
|
We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic_glue) benchmark dataset. |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to |
|
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for |
|
fine-tuned versions on a task that interests you. |
|
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) |
|
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text |
|
generation you should look at model like GPT2. |
|
|
|
### How to use |
|
|
|
To use this model you will need to first install the [Aksharamukha](https://pypi.org/project/aksharamukha/) library. |
|
|
|
```bash |
|
pip install aksharamukha |
|
``` |
|
|
|
Then you can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> from aksharamukha import transliterate |
|
>>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript') |
|
>>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।" |
|
>>> transliterated_text = transliterate.process('Bengali', 'ISO', text) |
|
>>> transliterated_text |
|
'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.' |
|
>>> unmasker(transliterated_text) |
|
[{'score': 0.39705055952072144, |
|
'token': 1500, |
|
'token_str': 'abhinētā', |
|
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli abhinētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}, |
|
{'score': 0.20499080419540405, |
|
'token': 3585, |
|
'token_str': 'kabi', |
|
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}, |
|
{'score': 0.1314290314912796, |
|
'token': 15402, |
|
'token_str': 'rājanētā', |
|
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli rājanētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}, |
|
{'score': 0.060830358415842056, |
|
'token': 3212, |
|
'token_str': 'kalākāra', |
|
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kalākāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}, |
|
{'score': 0.035522934049367905, |
|
'token': 11586, |
|
'token_str': 'sāhityakāra', |
|
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli sāhityakāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}] |
|
``` |
|
|
|
### Limitations and bias |
|
|
|
Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model. |
|
|
|
### BibTeX entry and citation info |
|
|
|
Coming soon! |
|
|
|
|
|
|