|
--- |
|
pipeline_tag: feature-extraction |
|
--- |
|
# Kurmanji Tokenizer |
|
|
|
This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language. |
|
|
|
## Model Details |
|
|
|
- **Model Name**: Kurmanji Tokenizer |
|
- **Language**: Kurmanji Kurdish (kmr) |
|
- **Corpus Size**: 50 million tokens |
|
- **Vocabulary Size**: 52,000 tokens |
|
- **Tokenizer Type**: Byte-Pair Encoding (BPE) |
|
|
|
## Training Data |
|
|
|
The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts. |
|
|
|
### Sources of the Corpus |
|
- Kurdish Kurmanji website crawling |
|
|
|
## Usage |
|
|
|
You can easily use this tokenizer with the Hugging Face `transformers` library: |
|
|
|
```python |
|
from transformers import PreTrainedTokenizerFast |
|
|
|
# Load the tokenizer |
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper") |
|
|
|
# Example usage |
|
text = "Navê min Ali ye." |
|
tokens = tokenizer.encode(text) |
|
print(tokens) |