File size: 1,209 Bytes

---
pipeline_tag: feature-extraction
---
# Kurmanji Tokenizer

This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

## Model Details

- **Model Name**: Kurmanji Tokenizer
- **Language**: Kurmanji Kurdish (kmr)
- **Corpus Size**: 50 million tokens
- **Vocabulary Size**: 52,000 tokens
- **Tokenizer Type**: Byte-Pair Encoding (BPE)

## Training Data

The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

### Sources of the Corpus
- Kurdish Kurmanji website crawling

## Usage

You can easily use this tokenizer with the Hugging Face `transformers` library:

```python
from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)