asosoft
/

KurmanjiTokenizer-Whisper

Feature Extraction

Model card Files Files and versions Community

KurmanjiTokenizer-Whisper / README.md

abdulhade's picture

Update README.md

dd0ef26 verified 7 months ago

|

history blame contribute delete

1.21 kB

	---
	pipeline_tag: feature-extraction
	---
	# Kurmanji Tokenizer

	This repository contains the Kurmanji Tokenizer trained on a 50 million token text corpus. The tokenizer was specifically developed to support the Kurmanji dialect of Kurdish, ensuring accurate and efficient tokenization for natural language processing tasks in this language.

	## Model Details

	- Model Name: Kurmanji Tokenizer
	- Language: Kurmanji Kurdish (kmr)
	- Corpus Size: 50 million tokens
	- Vocabulary Size: 52,000 tokens
	- Tokenizer Type: Byte-Pair Encoding (BPE)

	## Training Data

	The tokenizer was trained on a corpus of 50 million tokens collected from various sources in Kurmanji Kurdish. The data includes a wide range of text types, ensuring the tokenizer can handle diverse linguistic contexts.

	### Sources of the Corpus
	- Kurdish Kurmanji website crawling

	## Usage

	You can easily use this tokenizer with the Hugging Face `transformers` library:

	```python
	from transformers import PreTrainedTokenizerFast

	# Load the tokenizer
	tokenizer = PreTrainedTokenizerFast.from_pretrained("asosoft/KurmanjiTokenizer-Whisper")

	# Example usage
	text = "Navê min Ali ye."
	tokens = tokenizer.encode(text)
	print(tokens)