PersianBPETokenizer Model Card
Model Details
Model Description
The PersianBPETokenizer
is a custom tokenizer specifically designed for the Persian (Farsi) language. It leverages the Byte-Pair Encoding (BPE) algorithm to create a robust vocabulary that can effectively handle the unique characteristics of Persian text. This tokenizer is optimized for use with advanced language models like BERT and RoBERTa, making it a valuable tool for various Persian NLP tasks.
Model Type
- Tokenization Algorithm: Byte-Pair Encoding (BPE)
- Normalization: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
- Pre-tokenization: Whitespace
- Post-processing: TemplateProcessing for special tokens
Model Version
- Version: 1.0
- Date: September 6, 2024
License
- License: MIT
Developers
- Developed by: Mohammad Shojaei
- Contact: [email protected]
Citation
If you use this tokenizer in your research, please cite it as:
Mohammad Shojaei. (2024). PersianBPETokenizer [Software]. Available at https://huggingface.co/mshojaei77/PersianBPETokenizer.
Model Use
Intended Use
- Primary Use: Tokenization of Persian text for NLP tasks such as text classification, named entity recognition, machine translation, and more.
- Secondary Use: Integration with pre-trained language models like BERT and RoBERTa for fine-tuning on Persian datasets.
Out-of-Scope Use
- Non-Persian Text: This tokenizer is not designed for languages other than Persian.
- Non-NLP Tasks: It is not intended for use in non-NLP tasks such as image processing or audio analysis.
Data
Training Data
- Dataset:
mshojaei77/PersianTelegramChannels
- Description: A rich collection of Persian text extracted from various Telegram channels. This dataset provides a diverse range of language patterns and vocabulary, making it suitable for training a general-purpose Persian tokenizer.
- Size: 60,730 samples
Data Preprocessing
- Normalization: Applied NFD Unicode normalization, removed accents, converted text to lowercase, stripped leading and trailing whitespace, and removed ZWNJ characters.
- Pre-tokenization: Used whitespace pre-tokenization.
Performance
Evaluation Metrics
- Tokenization Accuracy: The tokenizer has been tested on various Persian sentences and has shown high accuracy in tokenizing and encoding text.
- Compatibility: Fully compatible with Hugging Face Transformers, ensuring seamless integration with advanced language models.
Known Limitations
- Vocabulary Size: The current vocabulary size is based on the training data. For very specialized domains, additional fine-tuning or training on domain-specific data may be required.
- Out-of-Vocabulary Words: Rare or domain-specific words may be tokenized as unknown tokens (
[UNK]
).
Training Procedure
Training Steps
- Environment Setup: Installed necessary libraries (
datasets
,tokenizers
,transformers
). - Data Preparation: Loaded the
mshojaei77/PersianTelegramChannels
dataset and created a batch iterator for efficient training. - Tokenizer Model: Initialized the tokenizer with a BPE model and applied normalization and pre-tokenization steps.
- Training: Trained the tokenizer on the Persian text corpus using the BPE algorithm.
- Post-processing: Set up post-processing to handle special tokens.
- Saving: Saved the tokenizer to disk for future use.
- Compatibility: Converted the tokenizer to a
PreTrainedTokenizerFast
object for compatibility with Hugging Face Transformers.
Hyperparameters
- Special Tokens:
[UNK]
,[CLS]
,[SEP]
,[PAD]
,[MASK]
- Batch Size: 1000 samples per batch
- Normalization Steps: NFD, StripAccents, Lowercase, Strip, Replace (ZWNJ)
How to Use
Installation
To use the PersianBPETokenizer
, first install the required libraries:
pip install -q --upgrade datasets tokenizers transformers
Loading the Tokenizer
You can load the tokenizer using the Hugging Face Transformers library:
from transformers import AutoTokenizer
persian_tokenizer = AutoTokenizer.from_pretrained("mshojaei77/PersianBPETokenizer")
Tokenization Example
test_sentence = "سلام، چطور هستید؟ امیدوارم روز خوبی داشته باشید"
tokens = persian_tokenizer.tokenize(test_sentence)
print("Tokens:", tokens)
encoded = persian_tokenizer(test_sentence)
print("Input IDs:", encoded["input_ids"])
print("Decoded:", persian_tokenizer.decode(encoded["input_ids"]))
Acknowledgments
- Dataset:
mshojaei77/PersianTelegramChannels
- Libraries: Hugging Face
datasets
,tokenizers
, andtransformers
References
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.