--- language: kn tags: - kannada - tokenizer - bpe - nlp - huggingface license: mit datasets: - Cognitive-Lab/Kannada-Instruct-dataset pipeline_tag: text-generation --- # Kannada Tokenizer [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer) This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text. ## Model Description - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer - **Language:** Kannada (`kn`) - **Vocabulary Size:** 32,000 - **Special Tokens:** - `[UNK]` (Unknown token) - `[PAD]` (Padding token) - `[CLS]` (Classifier token) - `[SEP]` (Separator token) - `[MASK]` (Masking token) - **License:** MIT License - **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset) - **Algorithm:** Byte-Pair Encoding (BPE) ## Intended Use This tokenizer is intended for NLP applications involving the Kannada language, such as: - **Language Modeling** - **Text Generation** - **Text Classification** - **Machine Translation** - **Named Entity Recognition** - **Question Answering** - **Summarization** ## How to Use You can load the tokenizer directly from the Hugging Face Hub: ```python from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer") # Example usage text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?" encoding = tokenizer.encode(text) tokens = tokenizer.convert_ids_to_tokens(encoding) decoded_text = tokenizer.decode(encoding) print("Original Text:", text) print("Tokens:", tokens) print("Decoded Text:", decoded_text) ``` **Output:** ``` Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?'] Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ? ``` ## Training Data The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization. - **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada. - **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters. ## Training Procedure - **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently. - **Pre-tokenization:** The text was pre-tokenized using whitespace splitting. - **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada. - **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency. - **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks. - **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library. ## Evaluation The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed. ## Limitations - **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms. - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications. - **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks. ## Ethical Considerations - **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included. - **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data. ## Recommendations - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data. - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements. ## Acknowledgments - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). - **Libraries:** - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) - [Hugging Face Transformers](https://github.com/huggingface/transformers) ## License This tokenizer is released under the [MIT License](LICENSE). ## Citation If you use this tokenizer in your research or applications, please consider citing it: ```bibtex @misc{kannada_tokenizer_2023, title={Kannada Tokenizer}, author={charanhu}, year={2023}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}}, } ```