Transformers
Amharic
English
File size: 2,231 Bytes
0b59eb7
 
30ca125
 
 
 
 
 
 
0b59eb7
 
d5be0a0
0b59eb7
ebd172d
 
37afd82
0b59eb7
a7ac3ab
 
0b59eb7
 
 
 
 
 
 
 
d5be0a0
22ab9b6
 
0b59eb7
e2a88ea
0b59eb7
d5be0a0
 
0b59eb7
d5be0a0
0b59eb7
d5be0a0
 
0b59eb7
d5be0a0
 
 
0b59eb7
 
d5be0a0
0b59eb7
d5be0a0
 
 
0b59eb7
 
22ab9b6
d5be0a0
 
0b59eb7
d5be0a0
 
 
0b59eb7
 
d5be0a0
0b59eb7
d5be0a0
 
22ab9b6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
license: cc-by-4.0
datasets:
- HuggingFaceFW/fineweb
- castorini/wura
language:
- am
- en
---

# AmhT5 Tokenizer

A T5 Tokenizer trained for the Amharic language.

The tokenizer has a Fertility rate: 1.8328

Notebook used for training: https://colab.research.google.com/drive/1B-pca9jpadTHz9WYTWXzPM-A1cTaltYo#scrollTo=wLslLc0D6TnA



## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets.
This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English. 
To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`.

### MT5 Tokenizer Vs AmhT5 Tokenizer

```python
from transformers import MT5TokenizerFast

mt5 = "google/mt5-small"

TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")

print(len(tokens)) # 20
print(tokens)
# ['β–αŠ¨αˆ˜', 'α‹²', 'αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­', 'α‰₯', '▁', 'ር', 'ቀ', 'ቡ', '▁', 'αˆ‹α‹­', 'β–α‰ αˆ', 'ቡ', 'ገ', 'ኘ', 'ው', 'β–αŠ¨α‰°', 'αˆ›']


tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")

print(len(tokens)) # 11
print(tokens)
# ['▁A', '▁', 'Token', 'izer', '▁train', 'ed', '▁for', '▁Am', 'haric', '▁language', '.']


amhT5 = "yonas/AmhT5-tokenizer"
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
tokens = TOKENIZER.tokenize("αŠ¨αˆ˜α‹²αŠ“α‹‹ α‰ α‰…αˆ­α‰₯ αˆ­α‰€α‰΅ αˆ‹α‹­ α‰ αˆα‰΅αŒˆαŠ˜α‹ αŠ¨α‰°αˆ›")

print(len(tokens)) # 11
print(tokens)
# ['β–αŠ¨', 'αˆ˜α‹²αŠ“', 'α‹‹', '▁በ', 'α‰…αˆ­α‰₯', '▁', 'αˆ­α‰€α‰΅', 'β–αˆ‹α‹­', 'β–α‰ αˆα‰΅', 'αŒˆαŠ˜α‹', 'β–αŠ¨α‰°αˆ›']


tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")

print(len(tokens)) # 7
print(tokens)
# ['▁A', '▁Token', 'izer', '▁trained', '▁for', '▁Amharic', '▁language.']
```