Mana Tokenizer
The Mana Tokenizer is a custom-trained SentencePiece tokenizer for Persian text, trained on a combination of the Persian Wikipedia and Ganjoor datasets. The tokenizer uses the Unigram model type, optimized for handling the unique characteristics of Persian text.
Special Tokens
- UNK Token:
<unk>
- BOS Token:
<s>
- EOS Token:
</s>
- PAD Token:
<pad>
Usage
You can load this tokenizer using the transformers
library as follows:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("tspersian/mana_tokenizer")
text = "این یک تست است."
encoded = tokenizer(text)
print(f"Encoded: {encoded}")
decoded = tokenizer.decode(encoded['input_ids'])
print(f"Decoded: {decoded}")
Statistics
Vocabulary Size: 199,997
Character Coverage: 99.9%
Total Number of Text Samples: 1,022,675
License
This tokenizer is licensed under the MIT License.