|
--- |
|
language: |
|
- ar |
|
- en |
|
|
|
dataset: |
|
- fka/awesome-chatgpt-prompts |
|
- open-r1/codeforces |
|
|
|
|
|
license: mit |
|
--- |
|
|
|
## Miscovery Tokenizer |
|
|
|
A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens. |
|
|
|
## Training Data |
|
This tokenizer was trained on: |
|
- Arabic Quran. |
|
- awesome-chatgpt-prompts |
|
- open-r1/codeforces |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer") |
|
|
|
# Example usage |
|
text = "بسم الله الرحمن الرحيم Hello World" |
|
encoded = tokenizer(text) |
|
print(encoded) |
|
``` |
|
|
|
## Features |
|
|
|
- Vocabulary size: 70,000 |
|
- Model type: Unigram |
|
- Model Max Length: 512 |
|
- Handles both Arabic and English text |
|
- Supports Arabic normalization |
|
|
|
|
|
|
|
|
|
|