Kannada Tokenizer

This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the translated_output column from the Cognitive-Lab/Kannada-Instruct-dataset. It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.

Model Description

Model Type: Byte-Pair Encoding (BPE) Tokenizer
Language: Kannada (kn)
Vocabulary Size: 32,000
Special Tokens:
- [UNK] (Unknown token)
- [PAD] (Padding token)
- [CLS] (Classifier token)
- [SEP] (Separator token)
- [MASK] (Masking token)
License: MIT License
Dataset Used: Cognitive-Lab/Kannada-Instruct-dataset
Algorithm: Byte-Pair Encoding (BPE)

Intended Use

This tokenizer is intended for NLP applications involving the Kannada language, such as:

Language Modeling
Text Generation
Text Classification
Machine Translation
Named Entity Recognition
Question Answering
Summarization

How to Use

You can load the tokenizer directly from the Hugging Face Hub:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("charanhu/kannada-tokenizer")

# Example usage
text = "ನೀವು ಹೇಗಿದ್ದೀರಿ?"
encoding = tokenizer.encode(text)
tokens = tokenizer.convert_ids_to_tokens(encoding)
decoded_text = tokenizer.decode(encoding)

print("Original Text:", text)
print("Tokens:", tokens)
print("Decoded Text:", decoded_text)

Output:

Original Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?

Training Data

The tokenizer was trained on the translated_output column from the Cognitive-Lab/Kannada-Instruct-dataset. This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.

Dataset Size: The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
Data Preprocessing: Text normalization was applied using NFKC normalization to standardize characters.

Training Procedure

Normalization: NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
Pre-tokenization: The text was pre-tokenized using whitespace splitting.
Tokenizer Algorithm: Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
Vocabulary Size: Set to 32,000 to balance between coverage and efficiency.
Special Tokens: Included [UNK], [PAD], [CLS], [SEP], [MASK] to support various downstream tasks.
Training Library: The tokenizer was built using the Hugging Face Tokenizers library.

Evaluation

The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.

Limitations

Vocabulary Coverage: While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
Biases: The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
Out-of-Vocabulary Words: Out-of-vocabulary words may be broken into subword tokens or mapped to the [UNK] token, which could affect performance in downstream tasks.

Ethical Considerations

Data Privacy: The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
Bias Mitigation: No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.

Recommendations

Fine-tuning: For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
Evaluation: Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.

Acknowledgments

Dataset: Thanks to Cognitive-Lab for providing the Kannada-Instruct-dataset.
Libraries:
- Hugging Face Tokenizers
- Hugging Face Transformers

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer in your research or applications, please consider citing it:

@misc{kannada_tokenizer_2023,
  title={Kannada Tokenizer},
  author={charanhu},
  year={2023},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
}

charanhu
/

kannada-tokenizer