abdulhade's picture
Update README.md
dd0ef26 verified
metadata
pipeline_tag: feature-extraction

Kurmanji Tokenizer

This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

Model Details

  • Model Name: Kurmanji Tokenizer
  • Language: Kurmanji Kurdish (kmr)
  • Corpus Size: 50 million tokens
  • Vocabulary Size: 52,000 tokens
  • Tokenizer Type: Byte-Pair Encoding (BPE)

Training Data

The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

Sources of the Corpus

  • Kurdish Kurmanji website crawling

Usage

You can easily use this tokenizer with the Hugging Face transformers library:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)