metinovadilet commited on
Commit
25c9b14
·
verified ·
1 Parent(s): c99db07

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -3
README.md CHANGED
@@ -1,3 +1,35 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ky
5
+ tags:
6
+ - tokenization
7
+ - BPE
8
+ - kyrgyz
9
+ - tokenizer
10
+ ---
11
+ A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 200,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage.
12
+ Features:
13
+
14
+ Language: Kyrgyz
15
+ Vocabulary Size: 200,000 subwords
16
+ Method: SentencePiece (BPE)
17
+
18
+ Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
19
+ Usage Example (Python with transformers):
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
25
+ text = "Кыргыз тили – бай жана кооз тил."
26
+ tokens = tokenizer(text)
27
+ print(tokens)
28
+ ```
29
+ Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.
30
+
31
+ License and Attribution
32
+ This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.
33
+
34
+ Feedback and Contributions
35
+ We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.