metadata

pipeline_tag: feature-extraction

Kurmanji Tokenizer

This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

Model Details

Model Name: Kurmanji Tokenizer
Language: Kurmanji Kurdish (kmr)
Corpus Size: 50 million tokens
Vocabulary Size: 52,000 tokens
Tokenizer Type: Byte-Pair Encoding (BPE)

Training Data

The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

Sources of the Corpus

Kurdish Kurmanji website crawling

Usage

You can easily use this tokenizer with the Hugging Face transformers library:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)