YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
BPE Tokenizer for Nepali LLM
This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face transformers
package on the Nepali LLM dataset. The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.
Overview
- Tokenizer Type: Byte Pair Encoding (BPE)
- Vocabulary Size: 50,000
- Dataset Used: Nepali LLM Datasets
Installation
To use the tokenizer, you need to install the transformers
library. You can install it via pip:
pip install transformers
Usage
You can easily load the tokenizer using the following code:
from transformers import PreTrainedTokenizerFast
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")
# Example usage
text = "तपाईंलाई कस्तो छ?"
tokens = tokenizer.encode(text)
print("Tokens:", tokens)
print("Decoded:", tokenizer.decode(tokens))