|
--- |
|
language: kn |
|
tags: |
|
- kannada |
|
- tokenizer |
|
- bpe |
|
- nlp |
|
- huggingface |
|
license: mit |
|
datasets: |
|
- Cognitive-Lab/Kannada-Instruct-dataset |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Kannada Tokenizer |
|
|
|
[![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer) |
|
|
|
This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text. |
|
|
|
## Model Description |
|
|
|
- **Model Type:** Byte-Pair Encoding (BPE) Tokenizer |
|
- **Language:** Kannada (`kn`) |
|
- **Vocabulary Size:** 32,000 |
|
- **Special Tokens:** |
|
- `[UNK]` (Unknown token) |
|
- `[PAD]` (Padding token) |
|
- `[CLS]` (Classifier token) |
|
- `[SEP]` (Separator token) |
|
- `[MASK]` (Masking token) |
|
- **License:** MIT License |
|
- **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset) |
|
- **Algorithm:** Byte-Pair Encoding (BPE) |
|
|
|
## Intended Use |
|
|
|
This tokenizer is intended for NLP applications involving the Kannada language, such as: |
|
|
|
- **Language Modeling** |
|
- **Text Generation** |
|
- **Text Classification** |
|
- **Machine Translation** |
|
- **Named Entity Recognition** |
|
- **Question Answering** |
|
- **Summarization** |
|
|
|
## How to Use |
|
|
|
You can load the tokenizer directly from the Hugging Face Hub: |
|
|
|
```python |
|
from transformers import PreTrainedTokenizerFast |
|
|
|
tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer") |
|
|
|
# Example usage |
|
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?" |
|
encoding = tokenizer.encode(text) |
|
tokens = tokenizer.convert_ids_to_tokens(encoding) |
|
decoded_text = tokenizer.decode(encoding) |
|
|
|
print("Original Text:", text) |
|
print("Tokens:", tokens) |
|
print("Decoded Text:", decoded_text) |
|
``` |
|
|
|
**Output:** |
|
|
|
``` |
|
Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? |
|
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?'] |
|
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? |
|
``` |
|
|
|
## Training Data |
|
|
|
The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization. |
|
|
|
- **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada. |
|
- **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters. |
|
|
|
## Training Procedure |
|
|
|
- **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently. |
|
- **Pre-tokenization:** The text was pre-tokenized using whitespace splitting. |
|
- **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada. |
|
- **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency. |
|
- **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks. |
|
- **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library. |
|
|
|
## Evaluation |
|
|
|
The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed. |
|
|
|
## Limitations |
|
|
|
- **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms. |
|
- **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications. |
|
- **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks. |
|
|
|
## Ethical Considerations |
|
|
|
- **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included. |
|
- **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data. |
|
|
|
## Recommendations |
|
|
|
- **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data. |
|
- **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements. |
|
|
|
## Acknowledgments |
|
|
|
- **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). |
|
- **Libraries:** |
|
- [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) |
|
- [Hugging Face Transformers](https://github.com/huggingface/transformers) |
|
|
|
## License |
|
|
|
This tokenizer is released under the [MIT License](LICENSE). |
|
|
|
## Citation |
|
|
|
If you use this tokenizer in your research or applications, please consider citing it: |
|
|
|
```bibtex |
|
@misc{kannada_tokenizer_2023, |
|
title={Kannada Tokenizer}, |
|
author={charanhu}, |
|
year={2023}, |
|
publisher={Hugging Face}, |
|
howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}}, |
|
} |
|
``` |
|
|