|
|
|
# NepaliBERT(Phase 1) |
|
NEPALIBERT is a state-of-the-art language model for Nepali based on the BERT model. The model is trained using a masked language modeling (MLM). |
|
|
|
# Loading the model and tokenizer |
|
1. clone the model repo |
|
``` |
|
git lfs install |
|
git clone https://huggingface.co/Rajan/NepaliBERT |
|
``` |
|
2. Loading the Tokenizer |
|
``` |
|
from transformers import BertTokenizer |
|
vocab_file_dir = './NepaliBERT/' |
|
tokenizer = BertTokenizer.from_pretrained(vocab_file_dir, |
|
strip_accents=False, |
|
clean_text=False ) |
|
``` |
|
3. Loading the model: |
|
``` |
|
from transformers import BertForMaskedLM |
|
model = BertForMaskedLM.from_pretrained('./NepaliBERT') |
|
``` |
|
|
|
The easiest way to check whether our language model is learning anything interesting is via the ```FillMaskPipeline```. |
|
|
|
Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, [mask]) and return a list of the most probable filled sequences, with their probabilities. |
|
|
|
``` |
|
from transformers import pipeline |
|
|
|
fill_mask = pipeline( |
|
"fill-mask", |
|
model=model, |
|
tokenizer=tokenizer |
|
) |
|
``` |
|
For more info visit the [GITHUB🤗](https://github.com/R4j4n/NepaliBERT) |