Swahili BERT WordPiece Tokenizer

A BERT WordPiece tokenizer specifically trained for the Swahili language. This tokenizer is designed to provide effective tokenization for Swahili text, supporting BERT-based models and other transformer architectures.

Model Details

Model type: BERT WordPiece Tokenizer
Language: Swahili
Vocabulary size: 25,000 tokens
Training Datasets: publicly available online data + 3D & Robotics Lab proprietary data.

Features

Specifically optimized for Swahili language patterns
Handles common Swahili morphological structures
Includes standard BERT special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])
Full compatibility with HuggingFace Transformers library

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Benjamin-png/bert-tokenizer-swahili_25000_minfreq_1")

# Example usage
text = "Habari za asubuhi"
encoded = tokenizer(text)
print(encoded.tokens)

Training Details

The tokenizer was trained with the following specifications:

Vocabulary size: 25,000 tokens
Minimum frequency: 1
Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Clean text: True
Handle Chinese characters: False
Strip accents: True
Lowercase: True

Example Outputs

Input: "Habari za asubuhi"
Tokens: ['[CLS]', 'habari', 'za', 'asubuhi', '[SEP]']

Input: "Ninafurahi kukutana nawe"
Tokens: ['[CLS]', 'ninafurahi', 'kukutana', 'nawe', '[SEP]']

Input: "Karibu Tanzania"
Tokens: ['[CLS]', 'karibu', 'tanzania', '[SEP]']

Limitations

The tokenizer's vocabulary is limited to the training data from the specified datasets
Performance may vary for specialized domains or dialects not well-represented in the training data
Rare or complex Swahili words might be split into subwords

Intended Use

This tokenizer is designed for:

Pre-processing Swahili text for BERT-based models
Natural Language Processing tasks in Swahili
Text analysis and processing applications

Citation

If you use this tokenizer in your research, please cite:

@misc{swahili-bert-tokenizer,
  author = {Benjamin-png},
  title = {BERT WordPiece Tokenizer for Swahili},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Benjamin-png/bert-tokenizer-swahili}}
}

Contact

For questions and feedback, please open an issue in the GitHub repository or contact through Hugging Face.

License

MIT License