--- pipeline_tag: feature-extraction --- # Kurmanji Tokenizer This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language. ## Model Details - **Model Name**: Kurmanji Tokenizer - **Language**: Kurmanji Kurdish (kmr) - **Corpus Size**: 50 million tokens - **Vocabulary Size**: 52,000 tokens - **Tokenizer Type**: Byte-Pair Encoding (BPE) ## Training Data The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts. ### Sources of the Corpus - Kurdish Kurmanji website crawling ## Usage You can easily use this tokenizer with the Hugging Face `transformers` library: ```python from transformers import PreTrainedTokenizerFast # Load the tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper") # Example usage text = "NavĂȘ min Ali ye." tokens = tokenizer.encode(text) print(tokens)