miscovery
/

tokenizer

Model card Files Files and versions Community

tokenizer / README.md

mahrnoud's picture

First Commit

9070cdf 4 months ago

|

history blame contribute delete

796 Bytes

	---
	language:
	- ar
	- en

	dataset:
	- fka/awesome-chatgpt-prompts
	- open-r1/codeforces


	license: mit
	---

	## Miscovery Tokenizer

	A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.

	## Training Data
	This tokenizer was trained on:
	- Arabic Quran.
	- awesome-chatgpt-prompts
	- open-r1/codeforces

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")

	# Example usage
	text = "بسم الله الرحمن الرحيم Hello World"
	encoded = tokenizer(text)
	print(encoded)
	```

	## Features

	- Vocabulary size: 70,000
	- Model type: Unigram
	- Model Max Length: 512
	- Handles both Arabic and English text
	- Supports Arabic normalization