kannada-tokenizer / README.md
charanhu's picture
Update README.md
10d210c verified
---
language: kn
tags:
- kannada
- tokenizer
- bpe
- nlp
- huggingface
license: mit
datasets:
- Cognitive-Lab/Kannada-Instruct-dataset
pipeline_tag: text-generation
---
# Kannada Tokenizer
[![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
## Model Description
- **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
- **Language:** Kannada (`kn`)
- **Vocabulary Size:** 32,000
- **Special Tokens:**
- `[UNK]` (Unknown token)
- `[PAD]` (Padding token)
- `[CLS]` (Classifier token)
- `[SEP]` (Separator token)
- `[MASK]` (Masking token)
- **License:** MIT License
- **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)
- **Algorithm:** Byte-Pair Encoding (BPE)
## Intended Use
This tokenizer is intended for NLP applications involving the Kannada language, such as:
- **Language Modeling**
- **Text Generation**
- **Text Classification**
- **Machine Translation**
- **Named Entity Recognition**
- **Question Answering**
- **Summarization**
## How to Use
You can load the tokenizer directly from the Hugging Face Hub:
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")
# Example usage
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
encoding = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoding)
decoded_text = tokenizer.decode(encoding)
print("Original Text:", text)
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)
```
**Output:**
```
Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
```
## Training Data
The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
- **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
- **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
## Training Procedure
- **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
- **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
- **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
- **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency.
- **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks.
- **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
## Evaluation
The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.
## Limitations
- **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
- **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
- **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks.
## Ethical Considerations
- **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
- **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.
## Recommendations
- **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
- **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
## Acknowledgments
- **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
- **Libraries:**
- [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
## License
This tokenizer is released under the [MIT License](LICENSE).
## Citation
If you use this tokenizer in your research or applications, please consider citing it:
```bibtex
@misc{kannada_tokenizer_2023,
title={Kannada Tokenizer},
author={charanhu},
year={2023},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
}
```