metinovadilet
/

KyrgyzTokenizer-BPE-200k

Model card Files Files and versions Community

metinovadilet commited on about 1 month ago

Commit

25c9b14

·

verified ·

1 Parent(s): c99db07

Update README.md

Files changed (1) hide show

README.md +35 -3

README.md CHANGED Viewed

@@ -1,3 +1,35 @@
----
-license: mit
----

+---
+license: mit
+language:
+- ky
+tags:
+- tokenization
+- BPE
+- kyrgyz
+- tokenizer
+---
+A tokenizer tailored for the Kyrgyz language, utilizing SentencePiece with Byte Pair Encoding (BPE) to offer efficient and precise tokenization. It features a 200,000-subword vocabulary, ensuring optimal performance for various Kyrgyz NLP tasks. This tokenizer was developed in collaboration with UlutSoft LLC to reflect authentic Kyrgyz language usage.
+Features:
+Language: Kyrgyz
+Vocabulary Size: 200,000 subwords
+Method: SentencePiece (BPE)
+Applications: Data preparation for language models, machine translation, sentiment analysis, chatbots.
+Usage Example (Python with transformers):
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Your/Tokenizer/Path")
+text = "Кыргыз тили – бай жана кооз тил."
+tokens = tokenizer(text)
+print(tokens)
+```
+Tip: Consider applying normalization or lemmatization during preprocessing to further enhance the results.
+License and Attribution
+This tokenizer is licensed under the MIT License and was developed in collaboration with UlutSoft LLC. Proper attribution is required when using this tokenizer or derived resources.
+Feedback and Contributions
+We welcome feedback, suggestions, and contributions! Please open an issue or a pull request in the repository to help us refine and enhance this resource.