Update README.md

10d210c verified 6 days ago

5.7 kB

	---
	language: kn
	tags:
	- kannada
	- tokenizer
	- bpe
	- nlp
	- huggingface
	license: mit
	datasets:
	- Cognitive-Lab/Kannada-Instruct-dataset
	pipeline_tag: text-generation
	---

	# Kannada Tokenizer

	[![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)

	This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.

	## Model Description

	- Model Type: Byte-Pair Encoding (BPE) Tokenizer
	- Language: Kannada (`kn`)
	- Vocabulary Size: 32,000
	- Special Tokens:
	- `[UNK]` (Unknown token)
	- `[PAD]` (Padding token)
	- `[CLS]` (Classifier token)
	- `[SEP]` (Separator token)
	- `[MASK]` (Masking token)
	- License: MIT License
	- Dataset Used: [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)
	- Algorithm: Byte-Pair Encoding (BPE)

	## Intended Use

	This tokenizer is intended for NLP applications involving the Kannada language, such as:

	- Language Modeling
	- Text Generation
	- Text Classification
	- Machine Translation
	- Named Entity Recognition
	- Question Answering
	- Summarization

	## How to Use

	You can load the tokenizer directly from the Hugging Face Hub:

	```python
	from transformers import PreTrainedTokenizerFast

	tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")

	# Example usage
	text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
	encoding = tokenizer.encode(text)
	tokens = tokenizer.convert_ids_to_tokens(encoding)
	decoded_text = tokenizer.decode(encoding)

	print("Original Text:", text)
	print("Tokens:", tokens)
	print("Decoded Text:", decoded_text)
	```

	Output:

	```
	Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
	Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
	Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
	```

	## Training Data

	The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.

	- Dataset Size: The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
	- Data Preprocessing: Text normalization was applied using NFKC normalization to standardize characters.

	## Training Procedure

	- Normalization: NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
	- Pre-tokenization: The text was pre-tokenized using whitespace splitting.
	- Tokenizer Algorithm: Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
	- Vocabulary Size: Set to 32,000 to balance between coverage and efficiency.
	- Special Tokens: Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks.
	- Training Library: The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.

	## Evaluation

	The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.

	## Limitations

	- Vocabulary Coverage: While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
	- Biases: The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
	- Out-of-Vocabulary Words: Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks.

	## Ethical Considerations

	- Data Privacy: The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
	- Bias Mitigation: No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.

	## Recommendations

	- Fine-tuning: For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
	- Evaluation: Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.

	## Acknowledgments

	- Dataset: Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
	- Libraries:
	- [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
	- [Hugging Face Transformers](https://github.com/huggingface/transformers)

	## License

	This tokenizer is released under the [MIT License](LICENSE).

	## Citation

	If you use this tokenizer in your research or applications, please consider citing it:

	```bibtex
	@misc{kannada_tokenizer_2023,
	title={Kannada Tokenizer},
	author={charanhu},
	year={2023},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
	}
	```