charanhu commited on
Commit
10d210c
1 Parent(s): bbe0b31

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -32
README.md CHANGED
@@ -1,10 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Kannada Tokenizer
2
 
3
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
4
 
5
  This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
6
 
7
- ## Model Details
8
 
9
  - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
10
  - **Language:** Kannada (`kn`)
@@ -15,33 +29,23 @@ This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannad
15
  - `[CLS]` (Classifier token)
16
  - `[SEP]` (Separator token)
17
  - `[MASK]` (Masking token)
18
-
19
- ## Training Data
20
-
21
- The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
22
-
23
- - **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
24
- - **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
25
-
26
- ## Training Procedure
27
-
28
- - **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
29
- - **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
30
- - **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
31
- - **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
32
 
33
  ## Intended Use
34
 
35
  This tokenizer is intended for NLP applications involving the Kannada language, such as:
36
 
37
- - Language Modeling
38
- - Text Classification
39
- - Machine Translation
40
- - Named Entity Recognition
41
- - Question Answering
42
- - Summarization
 
43
 
44
- ## Usage
45
 
46
  You can load the tokenizer directly from the Hugging Face Hub:
47
 
@@ -69,21 +73,42 @@ Tokens: ['ನೀವು', 'ಹೇಗಿದ್ದೀರಿ', '?']
69
  Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
70
  ```
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ## Limitations
73
 
74
- - **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada.
75
  - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
76
- - **OOV Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token.
 
 
 
 
 
77
 
78
  ## Recommendations
79
 
80
  - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
81
  - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
82
 
83
- ## License
84
-
85
- [MIT License](LICENSE)
86
-
87
  ## Acknowledgments
88
 
89
  - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
@@ -91,6 +116,10 @@ Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
91
  - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
92
  - [Hugging Face Transformers](https://github.com/huggingface/transformers)
93
 
 
 
 
 
94
  ## Citation
95
 
96
  If you use this tokenizer in your research or applications, please consider citing it:
@@ -100,10 +129,7 @@ If you use this tokenizer in your research or applications, please consider citi
100
  title={Kannada Tokenizer},
101
  author={charanhu},
102
  year={2023},
 
103
  howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
104
  }
105
  ```
106
-
107
- ## Contact Information
108
-
109
- For questions or comments about the tokenizer, please contact [charanhu](https://huggingface.co/charanhu).
 
1
+ ---
2
+ language: kn
3
+ tags:
4
+ - kannada
5
+ - tokenizer
6
+ - bpe
7
+ - nlp
8
+ - huggingface
9
+ license: mit
10
+ datasets:
11
+ - Cognitive-Lab/Kannada-Instruct-dataset
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
  # Kannada Tokenizer
16
 
17
  [![Hugging Face](https://img.shields.io/badge/HuggingFace-Model%20Card-orange)](https://huggingface.co/charanhu/kannada-tokenizer)
18
 
19
  This is a Byte-Pair Encoding (BPE) tokenizer trained specifically for the Kannada language using the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). It is suitable for various Natural Language Processing (NLP) tasks involving Kannada text.
20
 
21
+ ## Model Description
22
 
23
  - **Model Type:** Byte-Pair Encoding (BPE) Tokenizer
24
  - **Language:** Kannada (`kn`)
 
29
  - `[CLS]` (Classifier token)
30
  - `[SEP]` (Separator token)
31
  - `[MASK]` (Masking token)
32
+ - **License:** MIT License
33
+ - **Dataset Used:** [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset)
34
+ - **Algorithm:** Byte-Pair Encoding (BPE)
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Intended Use
37
 
38
  This tokenizer is intended for NLP applications involving the Kannada language, such as:
39
 
40
+ - **Language Modeling**
41
+ - **Text Generation**
42
+ - **Text Classification**
43
+ - **Machine Translation**
44
+ - **Named Entity Recognition**
45
+ - **Question Answering**
46
+ - **Summarization**
47
 
48
+ ## How to Use
49
 
50
  You can load the tokenizer directly from the Hugging Face Hub:
51
 
 
73
  Decoded Text: ನೀವು ಹೇಗಿದ್ದೀರಿ?
74
  ```
75
 
76
+ ## Training Data
77
+
78
+ The tokenizer was trained on the `translated_output` column from the [Cognitive-Lab/Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset). This dataset contains translated instructions and responses in Kannada, providing a rich corpus for effective tokenization.
79
+
80
+ - **Dataset Size:** The dataset includes a significant number of entries covering a wide range of topics and linguistic structures in Kannada.
81
+ - **Data Preprocessing:** Text normalization was applied using NFKC normalization to standardize characters.
82
+
83
+ ## Training Procedure
84
+
85
+ - **Normalization:** NFKC normalization was used to handle canonical decomposition and compatibility decomposition, ensuring that characters are represented consistently.
86
+ - **Pre-tokenization:** The text was pre-tokenized using whitespace splitting.
87
+ - **Tokenizer Algorithm:** Byte-Pair Encoding (BPE) was chosen for its effectiveness in handling subword units, which is beneficial for languages with rich morphology like Kannada.
88
+ - **Vocabulary Size:** Set to 32,000 to balance between coverage and efficiency.
89
+ - **Special Tokens:** Included `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, `[MASK]` to support various downstream tasks.
90
+ - **Training Library:** The tokenizer was built using the [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers) library.
91
+
92
+ ## Evaluation
93
+
94
+ The tokenizer was qualitatively evaluated on a set of Kannada sentences to ensure reasonable tokenization. However, quantitative evaluation metrics such as tokenization efficiency or perplexity were not computed.
95
+
96
  ## Limitations
97
 
98
+ - **Vocabulary Coverage:** While the tokenizer is trained on a diverse dataset, it may not include all possible words or phrases in Kannada, especially rare or domain-specific terms.
99
  - **Biases:** The tokenizer inherits any biases present in the training data. Users should be cautious when applying it to sensitive or critical applications.
100
+ - **Out-of-Vocabulary Words:** Out-of-vocabulary words may be broken into subword tokens or mapped to the `[UNK]` token, which could affect performance in downstream tasks.
101
+
102
+ ## Ethical Considerations
103
+
104
+ - **Data Privacy:** The dataset used is publicly available, and care was taken to ensure that no personal or sensitive information is included.
105
+ - **Bias Mitigation:** No specific bias mitigation techniques were applied. Users should be aware of potential biases in the tokenizer due to the training data.
106
 
107
  ## Recommendations
108
 
109
  - **Fine-tuning:** For best results in specific applications, consider fine-tuning language models with this tokenizer on domain-specific data.
110
  - **Evaluation:** Users should evaluate the tokenizer in their specific context to ensure it meets their requirements.
111
 
 
 
 
 
112
  ## Acknowledgments
113
 
114
  - **Dataset:** Thanks to [Cognitive-Lab](https://huggingface.co/Cognitive-Lab) for providing the [Kannada-Instruct-dataset](https://huggingface.co/datasets/Cognitive-Lab/Kannada-Instruct-dataset).
 
116
  - [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)
117
  - [Hugging Face Transformers](https://github.com/huggingface/transformers)
118
 
119
+ ## License
120
+
121
+ This tokenizer is released under the [MIT License](LICENSE).
122
+
123
  ## Citation
124
 
125
  If you use this tokenizer in your research or applications, please consider citing it:
 
129
  title={Kannada Tokenizer},
130
  author={charanhu},
131
  year={2023},
132
+ publisher={Hugging Face},
133
  howpublished={\url{https://huggingface.co/charanhu/kannada-tokenizer}},
134
  }
135
  ```