abdulhade's picture
Update README.md
dd0ef26 verified
---
pipeline_tag: feature-extraction
---
# Kurmanji Tokenizer
This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
## Model Details
- **Model Name**: Kurmanji Tokenizer
- **Language**: Kurmanji Kurdish (kmr)
- **Corpus Size**: 50 million tokens
- **Vocabulary Size**: 52,000 tokens
- **Tokenizer Type**: Byte-Pair Encoding (BPE)
## Training Data
The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
### Sources of the Corpus
- Kurdish Kurmanji website crawling
## Usage
You can easily use this tokenizer with the Hugging Face `transformers` library:
```python
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")
# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)