Swahili BERT WordPiece Tokenizer
A BERT WordPiece tokenizer specifically trained for the Swahili language. This tokenizer is designed to provide effective tokenization for Swahili text, supporting BERT-based models and other transformer architectures.
Model Details
- Model type: BERT WordPiece Tokenizer
- Language: Swahili
- Vocabulary size: 25,000 tokens
- Training Datasets: publicly available online data + 3D & Robotics Lab proprietary data.
Features
- Specifically optimized for Swahili language patterns
- Handles common Swahili morphological structures
- Includes standard BERT special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])
- Full compatibility with HuggingFace Transformers library
Usage
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Benjamin-png/bert-tokenizer-swahili_25000_minfreq_1")
# Example usage
text = "Habari za asubuhi"
encoded = tokenizer(text)
print(encoded.tokens)
Training Details
The tokenizer was trained with the following specifications:
- Vocabulary size: 25,000 tokens
- Minimum frequency: 1
- Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
- Clean text: True
- Handle Chinese characters: False
- Strip accents: True
- Lowercase: True
Example Outputs
Input: "Habari za asubuhi"
Tokens: ['[CLS]', 'habari', 'za', 'asubuhi', '[SEP]']
Input: "Ninafurahi kukutana nawe"
Tokens: ['[CLS]', 'ninafurahi', 'kukutana', 'nawe', '[SEP]']
Input: "Karibu Tanzania"
Tokens: ['[CLS]', 'karibu', 'tanzania', '[SEP]']
Limitations
- The tokenizer's vocabulary is limited to the training data from the specified datasets
- Performance may vary for specialized domains or dialects not well-represented in the training data
- Rare or complex Swahili words might be split into subwords
Intended Use
This tokenizer is designed for:
- Pre-processing Swahili text for BERT-based models
- Natural Language Processing tasks in Swahili
- Text analysis and processing applications
Citation
If you use this tokenizer in your research, please cite:
@misc{swahili-bert-tokenizer,
author = {Benjamin-png},
title = {BERT WordPiece Tokenizer for Swahili},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Benjamin-png/bert-tokenizer-swahili}}
}
Contact
For questions and feedback, please open an issue in the GitHub repository or contact through Hugging Face.
License
MIT License