ibraheemmoosa
commited on
Commit
·
7a62925
1
Parent(s):
3c45a9a
Add model description and usage examples.
Browse files
README.md
CHANGED
@@ -23,16 +23,143 @@ tags:
|
|
23 |
- masked-language-modeling
|
24 |
- sentence-order-prediction
|
25 |
- fill-mask
|
|
|
|
|
26 |
- nlp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
---
|
28 |
|
29 |
# XLMIndic Base Uniscript
|
30 |
|
31 |
-
Pretrained ALBERT model on the OSCAR corpus on the languages
|
32 |
-
Goan Konkani, Gujarati, Hindi, Maithili, Marathi, Nepali, Oriya, Panjabi, Sanskrit and Sinhala.
|
33 |
-
Like ALBERT it was pretrained using as masked language modeling (MLM) and a sentence order prediction (SOP)
|
34 |
-
objective. This model was pretrained after transliterating the text to ISO-15919 format using the Aksharamukha
|
35 |
library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
|
36 |
where you can transliterate your text and use it on our model on the inference widget.
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
|
|
23 |
- masked-language-modeling
|
24 |
- sentence-order-prediction
|
25 |
- fill-mask
|
26 |
+
- xlmindic
|
27 |
+
- exbert
|
28 |
- nlp
|
29 |
+
widget:
|
30 |
+
- text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
|
31 |
+
|
32 |
+
co2_eq_emissions:
|
33 |
+
emissions: "28.53 in grams of CO2"
|
34 |
+
source: "calculated using this webstie https://mlco2.github.io/impact/#compute"
|
35 |
+
training_type: "pretraining"
|
36 |
+
geographical_location: "NA"
|
37 |
+
hardware_used: "TPUv3-8 for about 180 hours or 7.5 days"
|
38 |
---
|
39 |
|
40 |
# XLMIndic Base Uniscript
|
41 |
|
42 |
+
Pretrained [ALBERT](https://arxiv.org/abs/1909.11942) model on the [OSCAR](https://huggingface.co/datasets/oscar) corpus on the 14 Indo-Aryan languages. This model was pretrained after transliterating the text to [ISO-15919](https://en.wikipedia.org/wiki/ISO_15919) format using the [Aksharamukha](https://pypi.org/project/aksharamukha/)
|
|
|
|
|
|
|
43 |
library. A demo of Aksharamukha library is hosted [here](https://aksharamukha.appspot.com/converter)
|
44 |
where you can transliterate your text and use it on our model on the inference widget.
|
45 |
|
46 |
+
## Model description
|
47 |
+
|
48 |
+
This model has the same configuration as the [ALBERT Base v2 model](https://huggingface.co/albert-base-v2/). Specifically, this model has the following configuration:
|
49 |
+
|
50 |
+
- 12 repeating layers
|
51 |
+
- 128 embedding dimension
|
52 |
+
- 768 hidden dimension
|
53 |
+
- 12 attention heads
|
54 |
+
- 11M parameters
|
55 |
+
|
56 |
+
## Training data
|
57 |
+
|
58 |
+
This model was pretrained on the [OSCAR](https://huggingface.co/datasets/oscar) dataset which is a medium sized multilingual corpus containing text from 163 languages. We select a subset of 14 languages based on the following criteria:
|
59 |
+
- Belongs to the [Indo-Aryan language family](https://en.wikipedia.org/wiki/Indo-Aryan_languages).
|
60 |
+
- Uses a [Brahmic script](https://en.wikipedia.org/wiki/Brahmic_scripts).
|
61 |
+
|
62 |
+
These are the 14 languages we pretrain this model on:
|
63 |
+
- Assamese
|
64 |
+
- Bangla
|
65 |
+
- Bihari
|
66 |
+
- Bishnupriya Manipuri
|
67 |
+
- Goan Konkani
|
68 |
+
- Gujarati
|
69 |
+
- Hindi
|
70 |
+
- Maithili
|
71 |
+
- Marathi
|
72 |
+
- Nepali
|
73 |
+
- Oriya
|
74 |
+
- Panjabi
|
75 |
+
- Sanskrit
|
76 |
+
- Sinhala.
|
77 |
+
|
78 |
+
## Training procedure
|
79 |
+
|
80 |
+
### Preprocessing
|
81 |
+
|
82 |
+
The texts are transliterated to ISO-15919 format using the Aksharamukha library. Then these are tokenized using SentencePiece and a vocabulary size of 50,000. The inputs of the model are
|
83 |
+
then of the form:
|
84 |
+
```
|
85 |
+
[CLS] Sentence A [SEP] Sentence B [SEP]
|
86 |
+
```
|
87 |
+
|
88 |
+
### Training
|
89 |
+
|
90 |
+
Training objective is the same as the original ALBERT.
|
91 |
+
.
|
92 |
+
The details of the masking procedure for each sentence are the following:
|
93 |
+
- 15% of the tokens are masked.
|
94 |
+
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
95 |
+
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
96 |
+
- In the 10% remaining cases, the masked tokens are left as is.
|
97 |
+
|
98 |
+
The details of the sentence order prediction example generation procedure for each sentence are the following:
|
99 |
+
- Split the sentence into two parts A and B at a random index.
|
100 |
+
- With 50% probability swap the two parts.
|
101 |
+
|
102 |
+
The model was pretrained on TPUv3-8 for 1M steps. We have checkpoints available every 10k steps. We will upload these in the future.
|
103 |
+
|
104 |
+
## Evaluation results
|
105 |
+
We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic_glue) benchmark dataset.
|
106 |
+
|
107 |
+
## Intended uses & limitations
|
108 |
+
|
109 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
110 |
+
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
|
111 |
+
fine-tuned versions on a task that interests you.
|
112 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
113 |
+
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
114 |
+
generation you should look at model like GPT2.
|
115 |
+
|
116 |
+
### How to use
|
117 |
+
|
118 |
+
To use this model you will need to first install the [Aksharamukha](https://pypi.org/project/aksharamukha/) library.
|
119 |
+
|
120 |
+
```bash
|
121 |
+
pip install aksharamukha
|
122 |
+
```
|
123 |
+
|
124 |
+
Then you can use this model directly with a pipeline for masked language modeling:
|
125 |
+
|
126 |
+
```python
|
127 |
+
>>> from transformers import pipeline
|
128 |
+
>>> from aksharamukha import transliterate
|
129 |
+
>>> unmasker = pipeline('fill-mask', model='ibraheemmoosa/xlmindic-base-uniscript')
|
130 |
+
>>> text = "রবীন্দ্রনাথ ঠাকুর এফআরএএস (৭ মে ১৮৬১ - ৭ আগস্ট ১৯৪১; ২৫ বৈশাখ ১২৬৮ - ২২ শ্রাবণ ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি [MASK], ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। ১৯১৩ সালে গীতাঞ্জলি কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি এশীয়দের মধ্যে সাহিত্যে প্রথম নোবেল পুরস্কার লাভ করেন।"
|
131 |
+
>>> transliterated_text = transliterate.process('Bengali', 'ISO', text)
|
132 |
+
>>> transliterated_text
|
133 |
+
'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
|
134 |
+
>>> unmasker(transliterated_text)
|
135 |
+
[{'score': 0.39705055952072144,
|
136 |
+
'token': 1500,
|
137 |
+
'token_str': 'abhinētā',
|
138 |
+
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli abhinētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
|
139 |
+
{'score': 0.20499080419540405,
|
140 |
+
'token': 3585,
|
141 |
+
'token_str': 'kabi',
|
142 |
+
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kabi, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
|
143 |
+
{'score': 0.1314290314912796,
|
144 |
+
'token': 15402,
|
145 |
+
'token_str': 'rājanētā',
|
146 |
+
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli rājanētā, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
|
147 |
+
{'score': 0.060830358415842056,
|
148 |
+
'token': 3212,
|
149 |
+
'token_str': 'kalākāra',
|
150 |
+
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli kalākāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'},
|
151 |
+
{'score': 0.035522934049367905,
|
152 |
+
'token': 11586,
|
153 |
+
'token_str': 'sāhityakāra',
|
154 |
+
'sequence': 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli sāhityakāra, aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama nōbēla puraskāra lābha karēna.'}]
|
155 |
+
```
|
156 |
+
|
157 |
+
### Limitations and bias
|
158 |
+
|
159 |
+
Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model.
|
160 |
+
|
161 |
+
### BibTeX entry and citation info
|
162 |
+
|
163 |
+
Coming soon!
|
164 |
+
|
165 |
|