File size: 5,704 Bytes
10d210c bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc 10d210c b3033fc bbe0b31 10d210c b3033fc bbe0b31 b3033fc bbe0b31 b3033fc 10d210c b3033fc 10d210c b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc 10d210c bbe0b31 b3033fc 10d210c bbe0b31 10d210c b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 b3033fc 10d210c bbe0b31 b3033fc bbe0b31 b3033fc bbe0b31 10d210c bbe0b31 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
language: kn
tags:
- kannada
- tokenizer
- bpe
- nlp
- huggingface
license: mit
datasets:
- Cognitive-Lab/Kannada-Instruct-dataset
pipeline_tag: text-generation
---
# Kannada Tokenizer
[![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
## Model Description
- **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
- **Language:** Kannada (`kn`)
- **Vocabulary Size:** 32,000
- **Special Tokens:**
- `[UNK]` (Unknown token)
- `[PAD]` (Padding token)
- `[CLS]` (Classifier token)
- `[SEP]` (Separator token)
- `[MASK]` (Masking token)
- **License:** MIT License
- **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)
- **Algorithm:** Byte-Pair Encoding (BPE)
## Intended Use
This tokenizer is intended for NLP applications involving the Kannada language, such as:
- **Language Modeling**
- **Text Generation**
- **Text Classification**
- **Machine Translation**
- **Named Entity Recognition**
- **Question Answering**
- **Summarization**
## How to Use
You can load the tokenizer directly from the Hugging Face Hub:
```python
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")
# Example usage
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
encoding = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoding)
decoded_text = tokenizer.decode(encoding)
print("Original Text:", text)
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)
```
**Output:**
```
Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
```
## Training Data
The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
- **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
- **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
## Training Procedure
- **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
- **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
- **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
- **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency.
- **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks.
- **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
## Evaluation
The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.
## Limitations
- **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
- **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
- **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks.
## Ethical Considerations
- **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
- **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.
## Recommendations
- **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
- **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
## Acknowledgments
- **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
- **Libraries:**
- [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
## License
This tokenizer is released under the [MIT License](LICENSE).
## Citation
If you use this tokenizer in your research or applications, please consider citing it:
```bibtex
@misc{kannada_tokenizer_2023,
title={Kannada Tokenizer},
author={charanhu},
year={2023},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
}
```
|