rasyosef
/

bert-mini-amharic

@@ -9,18 +9,20 @@ metrics:
 - perplexity
 pipeline_tag: fill-mask
 widget:
-- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
-  example_title: Example
 ---
 # bert-mini-amharic
 This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
 It achieves the following results on the evaluation set:
-- `Loss: 3.57`
-- `Perplexity: 35.52`
-Even though this model only has `9.7 Million` parameters, its performance is only slightly behind the 28x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
 # How to use
 You can use this model directly with a pipeline for masked language modeling:
@@ -30,26 +32,26 @@ You can use this model directly with a pipeline for masked language modeling:
 >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
 >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
-[{'score': 0.4713546335697174,
-  'token': 9308,
-  'token_str': 'ዓመት',
-  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
- {'score': 0.25726795196533203,
-  'token': 9540,
   'token_str': 'ዓመታት',
   'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
- {'score': 0.07067586481571198,
-  'token': 10354,
-  'token_str': 'አመት',
-  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
- {'score': 0.07064681500196457,
-  'token': 11212,
   'token_str': 'አመታት',
   'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
- {'score': 0.012558948248624802,
-  'token': 10588,
-  'token_str': 'ወራት',
-  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ወራት ተቆጥሯል ።'}]
 ```
 # Fine-tuning

 - perplexity
 pipeline_tag: fill-mask
 widget:
+  - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
+    example_title: Example 1
+  - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
+    example_title: Example 2
 ---
 # bert-mini-amharic
 This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
 It achieves the following results on the evaluation set:
+- `Loss: 3.11`
+- `Perplexity: 22.42`
+Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
 # How to use
 You can use this model directly with a pipeline for masked language modeling:
 >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
 >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
+[{'score': 0.6525624394416809,
+  'token': 9617,
   'token_str': 'ዓመታት',
   'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
+ {'score': 0.22671808302402496,
+  'token': 9345,
+  'token_str': 'ዓመት',
+  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
+ {'score': 0.07071439921855927,
+  'token': 10898,
   'token_str': 'አመታት',
   'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
+ {'score': 0.02838180586695671,
+  'token': 9913,
+  'token_str': 'አመት',
+  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
+ {'score': 0.006343209184706211,
+  'token': 22459,
+  'token_str': 'ዓመታትን',
+  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
 ```
 # Fine-tuning