Benjamin-png's picture
Update README.md
ac0deff verified

Swahili BERT WordPiece Tokenizer

A BERT WordPiece tokenizer specifically trained for the Swahili language. This tokenizer is designed to provide effective tokenization for Swahili text, supporting BERT-based models and other transformer architectures.

Model Details

  • Model type: BERT WordPiece Tokenizer
  • Language: Swahili
  • Vocabulary size: 25,000 tokens
  • Training Datasets: publicly available online data + 3D & Robotics Lab proprietary data.

Features

  • Specifically optimized for Swahili language patterns
  • Handles common Swahili morphological structures
  • Includes standard BERT special tokens ([CLS], [SEP], [MASK], [PAD], [UNK])
  • Full compatibility with HuggingFace Transformers library

Usage

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Benjamin-png/bert-tokenizer-swahili_25000_minfreq_1")

# Example usage
text = "Habari za asubuhi"
encoded = tokenizer(text)
print(encoded.tokens)

Training Details

The tokenizer was trained with the following specifications:

  • Vocabulary size: 25,000 tokens
  • Minimum frequency: 1
  • Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
  • Clean text: True
  • Handle Chinese characters: False
  • Strip accents: True
  • Lowercase: True

Example Outputs

Input: "Habari za asubuhi"
Tokens: ['[CLS]', 'habari', 'za', 'asubuhi', '[SEP]']

Input: "Ninafurahi kukutana nawe"
Tokens: ['[CLS]', 'ninafurahi', 'kukutana', 'nawe', '[SEP]']

Input: "Karibu Tanzania"
Tokens: ['[CLS]', 'karibu', 'tanzania', '[SEP]']

Limitations

  • The tokenizer's vocabulary is limited to the training data from the specified datasets
  • Performance may vary for specialized domains or dialects not well-represented in the training data
  • Rare or complex Swahili words might be split into subwords

Intended Use

This tokenizer is designed for:

  • Pre-processing Swahili text for BERT-based models
  • Natural Language Processing tasks in Swahili
  • Text analysis and processing applications

Citation

If you use this tokenizer in your research, please cite:

@misc{swahili-bert-tokenizer,
  author = {Benjamin-png},
  title = {BERT WordPiece Tokenizer for Swahili},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Benjamin-png/bert-tokenizer-swahili}}
}

Contact

For questions and feedback, please open an issue in the GitHub repository or contact through Hugging Face.

License

MIT License