File size: 1,209 Bytes
f434b1d c2864b8 dd0ef26 c2864b8 5f145ff c2864b8 f434b1d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
---
pipeline_tag: feature-extraction
---
# Kurmanji Tokenizer
This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
## Model Details
- **Model Name**: Kurmanji Tokenizer
- **Language**: Kurmanji Kurdish (kmr)
- **Corpus Size**: 50 million tokens
- **Vocabulary Size**: 52,000 tokens
- **Tokenizer Type**: Byte-Pair Encoding (BPE)
## Training Data
The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
### Sources of the Corpus
- Kurdish Kurmanji website crawling
## Usage
You can easily use this tokenizer with the Hugging Face `transformers` library:
```python
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")
# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens) |