tokenizer_v2 / README.md
mahrnoud's picture
Update Readme.md file
cd155f9
metadata
language:
  - ar
  - en
dataset:
  - fka/awesome-chatgpt-prompts
  - open-r1/codeforces
license: mit

Miscovery Tokenizer

A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 100,000 tokens.

Training Data

This tokenizer was trained on:

  • Arabic Quran.
  • awesome-chatgpt-prompts
  • open-r1/codeforces

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("miscovery/tokenizer")

# Example usage
text = "بسم الله الرحمن الرحيم"
encoded = tokenizer(text)
print(encoded)

Features

  • Vocabulary size: 100,000
  • Model type: Unigram
  • Model Max Length: 512
  • Handles both Arabic and English text
  • Supports Arabic normalization