ibraheemmoosa commited on
Commit
fff120b
·
1 Parent(s): 51c76d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -3
README.md CHANGED
@@ -31,7 +31,6 @@ tags:
31
  - transliteration
32
  widget:
33
  - text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
34
- - example_title : 'Rabindranath Tagore'
35
 
36
  co2_eq_emissions:
37
  emissions: "28.53 in grams of CO2"
@@ -77,7 +76,17 @@ These are the 14 languages we pretrain this model on:
77
  - Oriya
78
  - Panjabi
79
  - Sanskrit
80
- - Sinhala.
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Training procedure
83
 
@@ -110,6 +119,8 @@ We evaluated this model on the [IndicGLUE](https://huggingface.co/datasets/indic
110
 
111
  ## Intended uses & limitations
112
 
 
 
113
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
114
  be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
115
  fine-tuned versions on a task that interests you.
@@ -160,7 +171,7 @@ Then you can use this model directly with a pipeline for masked language modelin
160
 
161
  ### Limitations and bias
162
 
163
- Even though train on a comparatively large multilingual corpus the model may exhibit harmful Gender, Ethnic and Political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on this model.
164
 
165
  ### BibTeX entry and citation info
166
 
 
31
  - transliteration
32
  widget:
33
  - text : 'rabīndranātha ṭhākura ēphaāraēēsa (7 mē 1861 - 7 āgasṭa 1941; 25 baiśākha 1268 - 22 śrābaṇa 1348 baṅgābda) chilēna agraṇī bāṅāli [MASK], aupanyāsika, saṁgītasraṣṭā, nāṭyakāra, citrakara, chōṭagalpakāra, prābandhika, abhinētā, kaṇṭhaśilpī ō dārśanika. 1913 sālē gītāñjali kābyagranthēra iṁrēji anubādēra janya tini ēśīẏadēra madhyē sāhityē prathama [MASK] puraskāra lābha karēna.'
 
34
 
35
  co2_eq_emissions:
36
  emissions: "28.53 in grams of CO2"
 
76
  - Oriya
77
  - Panjabi
78
  - Sanskrit
79
+ - Sinhala
80
+
81
+ ## Transliteration
82
+
83
+ The unique component of this model is that it takes in ISO-15919 transliterated text.
84
+
85
+ The motivation behind this is this. When two languages share vocabularies, a machine learning model can exploit that to learn good cross-lingual representations. However if these two languages use different writing scripts it is difficult for a model to make the connection. Thus if if we can write the two languages in a single script then it is easier for the model to learn good cross-lingual representation.
86
+
87
+ For many of the scripts currently in use, there are standard transliteration schemes to convert to the Latin script. In particular, for the Indic scripts the ISO-15919 transliteration scheme is designed to consistently transliterate texts written in different Indic scripts to the Latin script.
88
+
89
+ This model has been trained on ISO-15919 transliterated text of various Indo-Aryan languages.
90
 
91
  ## Training procedure
92
 
 
119
 
120
  ## Intended uses & limitations
121
 
122
+ This model is pretrained on Indo-Aryan languages. Thus it is intended to be used for downstream tasks on these languages. However, since Dravidian languages such as Malayalam, Telegu, Kannada etc share a lot of vocabulary with the Indo-Aryan languages, this model can potentially be used on those languages too (after transliterating the text to ISO-15919).
123
+
124
  You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
125
  be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=xlmindic) to look for
126
  fine-tuned versions on a task that interests you.
 
171
 
172
  ### Limitations and bias
173
 
174
+ Even though we pretrain on a comparatively large multilingual corpus the model may exhibit harmful gender, ethnic and political bias. If you fine-tune this model on a task where these issues are important you should take special care when relying on the model to make decisions.
175
 
176
  ### BibTeX entry and citation info
177