rasyosef commited on
Commit
1dac90a
·
verified ·
1 Parent(s): fe4cbba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -21
README.md CHANGED
@@ -9,18 +9,20 @@ metrics:
9
  - perplexity
10
  pipeline_tag: fill-mask
11
  widget:
12
- - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
13
- example_title: Example
 
 
14
  ---
15
 
16
  # bert-mini-amharic
17
 
18
  This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
19
  It achieves the following results on the evaluation set:
20
- - `Loss: 3.57`
21
- - `Perplexity: 35.52`
22
 
23
- Even though this model only has `9.7 Million` parameters, its performance is only slightly behind the 28x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
24
 
25
  # How to use
26
  You can use this model directly with a pipeline for masked language modeling:
@@ -30,26 +32,26 @@ You can use this model directly with a pipeline for masked language modeling:
30
  >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
31
  >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
32
 
33
- [{'score': 0.4713546335697174,
34
- 'token': 9308,
35
- 'token_str': 'ዓመት',
36
- 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
37
- {'score': 0.25726795196533203,
38
- 'token': 9540,
39
  'token_str': 'ዓመታት',
40
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
41
- {'score': 0.07067586481571198,
42
- 'token': 10354,
43
- 'token_str': 'አመት',
44
- 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
45
- {'score': 0.07064681500196457,
46
- 'token': 11212,
47
  'token_str': 'አመታት',
48
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
49
- {'score': 0.012558948248624802,
50
- 'token': 10588,
51
- 'token_str': 'ወራት',
52
- 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ወራት ተቆጥሯል ።'}]
 
 
 
 
53
  ```
54
 
55
  # Fine-tuning
 
9
  - perplexity
10
  pipeline_tag: fill-mask
11
  widget:
12
+ - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
13
+ example_title: Example 1
14
+ - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
15
+ example_title: Example 2
16
  ---
17
 
18
  # bert-mini-amharic
19
 
20
  This model has the same architecture as [bert-mini](https://huggingface.co/prajjwal1/bert-mini) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar) and [mc4](https://huggingface.co/datasets/mc4) datasets, on a total of `137 Million` tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
21
  It achieves the following results on the evaluation set:
22
+ - `Loss: 3.11`
23
+ - `Perplexity: 22.42`
24
 
25
+ Even though this model only has `10.7 Million` parameters, its performance is only slightly behind the 26x larger `279 Million` parameter [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) model on the same Amharic evaluation set.
26
 
27
  # How to use
28
  You can use this model directly with a pipeline for masked language modeling:
 
32
  >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
33
  >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
34
 
35
+ [{'score': 0.6525624394416809,
36
+ 'token': 9617,
 
 
 
 
37
  'token_str': 'ዓመታት',
38
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
39
+ {'score': 0.22671808302402496,
40
+ 'token': 9345,
41
+ 'token_str': 'ዓመት',
42
+ 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
43
+ {'score': 0.07071439921855927,
44
+ 'token': 10898,
45
  'token_str': 'አመታት',
46
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
47
+ {'score': 0.02838180586695671,
48
+ 'token': 9913,
49
+ 'token_str': 'አመት',
50
+ 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
51
+ {'score': 0.006343209184706211,
52
+ 'token': 22459,
53
+ 'token_str': 'ዓመታትን',
54
+ 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታትን ተቆጥሯል ።'}]
55
  ```
56
 
57
  # Fine-tuning