BPE Tokenizer for Nepali LLM

This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face transformers package on the Nepali LLM dataset. The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.

Overview

Tokenizer Type: Byte Pair Encoding (BPE)
Vocabulary Size: 50,000
Dataset Used: Nepali LLM Datasets

Installation

To use the tokenizer, you need to install the transformers library. You can install it via pip:

pip install transformers

Usage

You can easily load the tokenizer using the following code:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")

# Example usage
text = "तपाईंलाई कस्तो छ?"
tokens = tokenizer.encode(text)
print("Tokens:", tokens)
print("Decoded:", tokenizer.decode(tokens))