Update README.md
Browse files
README.md
CHANGED
@@ -9,18 +9,20 @@ metrics:
|
|
9 |
- perplexity
|
10 |
pipeline_tag: fill-mask
|
11 |
widget:
|
12 |
-
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
|
13 |
-
|
|
|
|
|
14 |
---
|
15 |
|
16 |
# bert-mini-amharic
|
17 |
|
18 |
This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
|
19 |
It achieves the following results on the evaluation set:
|
20 |
-
- `Loss: 3.
|
21 |
-
- `Perplexity:
|
22 |
|
23 |
-
Even though this model only has `
|
24 |
|
25 |
# How to use
|
26 |
You can use this model directly with a pipeline for masked language modeling:
|
@@ -30,26 +32,26 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
30 |
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
|
31 |
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
|
32 |
|
33 |
-
[{'score': 0.
|
34 |
-
'token':
|
35 |
-
'token_str': 'ዓመት',
|
36 |
-
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
|
37 |
-
{'score': 0.25726795196533203,
|
38 |
-
'token': 9540,
|
39 |
'token_str': 'ዓመታት',
|
40 |
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
|
41 |
-
{'score': 0.
|
42 |
-
'token':
|
43 |
-
'token_str': '
|
44 |
-
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ
|
45 |
-
{'score': 0.
|
46 |
-
'token':
|
47 |
'token_str': 'አመታት',
|
48 |
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
|
49 |
-
{'score': 0.
|
50 |
-
'token':
|
51 |
-
'token_str': '
|
52 |
-
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ
|
|
|
|
|
|
|
|
|
53 |
```
|
54 |
|
55 |
# Fine-tuning
|
|
|
9 |
- perplexity
|
10 |
pipeline_tag: fill-mask
|
11 |
widget:
|
12 |
+
- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
|
13 |
+
example_title: Example 1
|
14 |
+
- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
|
15 |
+
example_title: Example 2
|
16 |
---
|
17 |
|
18 |
# bert-mini-amharic
|
19 |
|
20 |
This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
|
21 |
It achieves the following results on the evaluation set:
|
22 |
+
- `Loss: 3.11`
|
23 |
+
- `Perplexity: 22.42`
|
24 |
|
25 |
+
Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
|
26 |
|
27 |
# How to use
|
28 |
You can use this model directly with a pipeline for masked language modeling:
|
|
|
32 |
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
|
33 |
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
|
34 |
|
35 |
+
[{'score': 0.6525624394416809,
|
36 |
+
'token': 9617,
|
|
|
|
|
|
|
|
|
37 |
'token_str': 'ዓመታት',
|
38 |
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
|
39 |
+
{'score': 0.22671808302402496,
|
40 |
+
'token': 9345,
|
41 |
+
'token_str': 'ዓመት',
|
42 |
+
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
|
43 |
+
{'score': 0.07071439921855927,
|
44 |
+
'token': 10898,
|
45 |
'token_str': 'አመታት',
|
46 |
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
|
47 |
+
{'score': 0.02838180586695671,
|
48 |
+
'token': 9913,
|
49 |
+
'token_str': 'አመት',
|
50 |
+
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
|
51 |
+
{'score': 0.006343209184706211,
|
52 |
+
'token': 22459,
|
53 |
+
'token_str': 'ዓመታትን',
|
54 |
+
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
|
55 |
```
|
56 |
|
57 |
# Fine-tuning
|