File size: 1,209 Bytes
f434b1d
 
 
c2864b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd0ef26
c2864b8
 
 
 
 
 
 
 
 
5f145ff
c2864b8
 
 
 
f434b1d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
pipeline_tag: feature-extraction
---
# Kurmanji Tokenizer

This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

## Model Details

- **Model Name**: Kurmanji Tokenizer
- **Language**: Kurmanji Kurdish (kmr)
- **Corpus Size**: 50 million tokens
- **Vocabulary Size**: 52,000 tokens
- **Tokenizer Type**: Byte-Pair Encoding (BPE)

## Training Data

The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

### Sources of the Corpus
- Kurdish Kurmanji website crawling

## Usage

You can easily use this tokenizer with the Hugging Face `transformers` library:

```python
from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

# Example usage
text = "Navê min Ali ye."
tokens = tokenizer.encode(text)
print(tokens)