abdulhade commited on
Commit
c2864b8
·
verified ·
1 Parent(s): 9c63c69

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kurmanji Tokenizer
2
+
3
+ This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Name**: Kurmanji Tokenizer
8
+ - **Language**: Kurmanji Kurdish (kmr)
9
+ - **Corpus Size**: 50 million tokens
10
+ - **Vocabulary Size**: 52,000 tokens
11
+ - **Tokenizer Type**: Byte-Pair Encoding (BPE)
12
+
13
+ ## Training Data
14
+
15
+ The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
16
+
17
+ ### Sources of the Corpus
18
+ - Kurdish literature
19
+ - News articles
20
+ - Social media text
21
+ - Religious texts (e.g., the Quran)
22
+ - Conversational transcripts
23
+
24
+ ## Usage
25
+
26
+ You can easily use this tokenizer with the Hugging Face `transformers` library:
27
+
28
+ ```python
29
+ from transformers import PreTrainedTokenizerFast
30
+
31
+ # Load the tokenizer
32
+ tokenizer = PreTrainedTokenizerFast.from_pretrained("your-username/kurmanji_tokenizer")
33
+
34
+ # Example usage
35
+ text = "Navê min Ali ye."
36
+ tokens = tokenizer.encode(text)
37
+ print(tokens)