Mana Tokenizer

The Mana Tokenizer is a custom-trained SentencePiece tokenizer for Persian text, trained on a combination of the Persian Wikipedia and Ganjoor datasets. The tokenizer uses the Unigram model type, optimized for handling the unique characteristics of Persian text.

Special Tokens

UNK Token: <unk>
BOS Token: <s>
EOS Token: </s>
PAD Token: <pad>

Usage

You can load this tokenizer using the transformers library as follows:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("tspersian/mana_tokenizer")

text = "این یک تست است."
encoded = tokenizer(text)
print(f"Encoded: {encoded}")

decoded = tokenizer.decode(encoded['input_ids'])
print(f"Decoded: {decoded}")

Statistics

Vocabulary Size: 199,997
Character Coverage: 99.9%
Total Number of Text Samples: 1,022,675

License

This tokenizer is licensed under the MIT License.