asosoft
/

KurmanjiTokenizer-Whisper

Feature Extraction

Model card Files Files and versions Community

abdulhade commited on Sep 2, 2024

Commit

c2864b8

·

verified ·

1 Parent(s): 9c63c69

Create README.md

Files changed (1) hide show

README.md +37 -0

README.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# Kurmanji Tokenizer
+This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.
+## Model Details
+- **Model Name**: Kurmanji Tokenizer
+- **Language**: Kurmanji Kurdish (kmr)
+- **Corpus Size**: 50 million tokens
+- **Vocabulary Size**: 52,000 tokens
+- **Tokenizer Type**: Byte-Pair Encoding (BPE)
+## Training Data
+The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.
+### Sources of the Corpus
+- Kurdish literature
+- News articles
+- Social media text
+- Religious texts (e.g., the Quran)
+- Conversational transcripts
+## Usage
+You can easily use this tokenizer with the Hugging Face `transformers` library:
+```python
+from transformers import PreTrainedTokenizerFast
+# Load the tokenizer
+tokenizer = PreTrainedTokenizerFast.from_pretrained("your-username/kurmanji_tokenizer")
+# Example usage
+text = "Navê min Ali ye."
+tokens = tokenizer.encode(text)
+print(tokens)