File size: 2,231 Bytes
0b59eb7 30ca125 0b59eb7 d5be0a0 0b59eb7 ebd172d 37afd82 0b59eb7 a7ac3ab 0b59eb7 d5be0a0 22ab9b6 0b59eb7 e2a88ea 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 22ab9b6 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 0b59eb7 d5be0a0 22ab9b6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
library_name: transformers
license: cc-by-4.0
datasets:
- HuggingFaceFW/fineweb
- castorini/wura
language:
- am
- en
---
# AmhT5 Tokenizer
A T5 Tokenizer trained for the Amharic language.
The tokenizer has a Fertility rate: 1.8328
Notebook used for training: https://colab.research.google.com/drive/1B-pca9jpadTHz9WYTWXzPM-A1cTaltYo#scrollTo=wLslLc0D6TnA
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
An MT5Tokenizer based Amharic and English tokenizer trained using [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [Wura](https://huggingface.co/datasets/castorini/wura) datasets.
This tokenizer aims to have a tokenizer that can better represent Amharic while also doing the same for English.
To balance the dataset, I have used only 3 million document samples from the dataset. The vocabulary size of this tokenizer is the same as `google/mt5-small`.
### MT5 Tokenizer Vs AmhT5 Tokenizer
```python
from transformers import MT5TokenizerFast
mt5 = "google/mt5-small"
TOKENIZER = MT5TokenizerFast.from_pretrained(mt5, legacy=False)
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α")
print(len(tokens)) # 20
print(tokens)
# ['βα¨α', 'α²', 'α', 'α', 'βα ', 'α
α', 'α₯', 'β', 'α', 'α', 'α΅', 'β', 'αα', 'βα α', 'α΅', 'α', 'α', 'α', 'βα¨α°', 'α']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 11
print(tokens)
# ['βA', 'β', 'Token', 'izer', 'βtrain', 'ed', 'βfor', 'βAm', 'haric', 'βlanguage', '.']
amhT5 = "yonas/AmhT5-tokenizer"
TOKENIZER = MT5TokenizerFast.from_pretrained(amhT5, legacy=False)
tokens = TOKENIZER.tokenize("α¨αα²αα α α
αα₯ ααα΅ αα α αα΅ααα α¨α°α")
print(len(tokens)) # 11
print(tokens)
# ['βα¨', 'αα²α', 'α', 'βα ', 'α
αα₯', 'β', 'ααα΅', 'βαα', 'βα αα΅', 'ααα', 'βα¨α°α']
tokens = TOKENIZER.tokenize("A Tokenizer trained for Amharic language.")
print(len(tokens)) # 7
print(tokens)
# ['βA', 'βToken', 'izer', 'βtrained', 'βfor', 'βAmharic', 'βlanguage.']
```
|