metadata
language:
- am
library_name: transformers
datasets:
- oscar
- mc4
metrics:
- perplexity
pipeline_tag: fill-mask
widget:
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
example_title: Example 1
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
example_title: Example 2
- text: >-
ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ
ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ
ነው።
example_title: Example 3
- text: >-
ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት
አንዱ ነው።
example_title: Example 4
bert-mini-amharic
This model has the same architecture as bert-mini and was pretrained from scratch using the Amharic subsets of the oscar and mc4 datasets, on a total of 137 Million
tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
It achieves the following results on the evaluation set:
Loss: 3.11
Perplexity: 22.42
Even though this model only has 10.7 Million
parameters, its performance is only slightly behind the 26x larger 279 Million
parameter xlm-roberta-base model on the same Amharic evaluation set.
How to use
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.6525624394416809,
'token': 9617,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.22671808302402496,
'token': 9345,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.07071439921855927,
'token': 10898,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.02838180586695671,
'token': 9913,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.006343209184706211,
'token': 22459,
'token_str': 'ዓመታትን',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks
- Sentiment Classification
- Named Entity Recognition
- News Category Classification
Finetuned Model Performance
The reported F1 scores are macro averages.
Model | Size (# params) | Perplexity | Sentiment (F1) | Named Entity Recognition (F1) |
---|---|---|---|---|
bert-medium-amharic | 40.5M | 13.74 | 0.83 | 0.68 |
bert-small-amharic | 27.8M | 15.96 | 0.83 | 0.68 |
bert-mini-amharic | 10.7M | 22.42 | 0.81 | 0.64 |
bert-tiny-amharic | 4.18M | 71.52 | 0.79 | 0.54 |
xlm-roberta-base | 279M | 0.83 | 0.73 | |
am-roberta | 443M | 0.82 | 0.69 |
Amharic News Category Classification
Model | Size(# params) | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|
bert-small-amharic | 25.7M | 0.89 | 0.86 | 0.87 | 0.86 |
bert-mini-amharic | 9.67M | 0.87 | 0.83 | 0.83 | 0.83 |
xlm-roberta-base | 279M | 0.9 | 0.88 | 0.88 | 0.88 |