Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Kurmanji Tokenizer
|
2 |
+
|
3 |
+
This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
|
4 |
+
|
5 |
+
## Model Details
|
6 |
+
|
7 |
+
- **Model Name**: Kurmanji Tokenizer
|
8 |
+
- **Language**: Kurmanji Kurdish (kmr)
|
9 |
+
- **Corpus Size**: 50 million tokens
|
10 |
+
- **Vocabulary Size**: 52,000 tokens
|
11 |
+
- **Tokenizer Type**: Byte-Pair Encoding (BPE)
|
12 |
+
|
13 |
+
## Training Data
|
14 |
+
|
15 |
+
The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
|
16 |
+
|
17 |
+
### Sources of the Corpus
|
18 |
+
- Kurdish literature
|
19 |
+
- News articles
|
20 |
+
- Social media text
|
21 |
+
- Religious texts (e.g., the Quran)
|
22 |
+
- Conversational transcripts
|
23 |
+
|
24 |
+
## Usage
|
25 |
+
|
26 |
+
You can easily use this tokenizer with the Hugging Face `transformers` library:
|
27 |
+
|
28 |
+
```python
|
29 |
+
from transformers import PreTrainedTokenizerFast
|
30 |
+
|
31 |
+
# Load the tokenizer
|
32 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("your-username/kurmanji_tokenizer")
|
33 |
+
|
34 |
+
# Example usage
|
35 |
+
text = "Navê min Ali ye."
|
36 |
+
tokens = tokenizer.encode(text)
|
37 |
+
print(tokens)
|