|
--- |
|
license: unknown |
|
language: |
|
- si |
|
metrics: |
|
- perplexity |
|
library_name: transformers |
|
tags: |
|
- AshenBerto |
|
- Sinhala |
|
- Roberta |
|
--- |
|
|
|
|
|
|
|
### 🌟 Overview |
|
|
|
This is a slightly smaller model trained on half of the [FastText](https://fasttext.cc/docs/en/crawl-vectors.html) dataset. Since Sinhala is a low-resource language, there’s a noticeable lack of pre-trained models available for it. 😕 This gap makes it harder to represent the language properly in the world of NLP. |
|
|
|
But hey, that’s where this model comes in! 🚀 It opens up exciting opportunities to improve tasks like sentiment analysis, machine translation, named entity recognition, or even question answering—tailored just for Sinhala. 🇱🇰✨ |
|
|
|
--- |
|
|
|
### 🛠 Model Specs |
|
|
|
Here’s what powers this model (we went with [RoBERTa](https://arxiv.org/abs/1907.11692)): |
|
|
|
1️⃣ **vocab_size** = 25,000 |
|
2️⃣ **max_position_embeddings** = 514 |
|
3️⃣ **num_attention_heads** = 12 |
|
4️⃣ **num_hidden_layers** = 6 |
|
5️⃣ **type_vocab_size** = 1 |
|
🎯 **Perplexity Value**: 3.5 |
|
|
|
--- |
|
|
|
### 🚀 How to Use |
|
|
|
You can jump right in and use this model for masked language modeling! 🧩 |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline |
|
|
|
# Load the model and tokenizer |
|
model = AutoModelWithLMHead.from_pretrained("ashenR/AshenBERTo") |
|
tokenizer = AutoTokenizer.from_pretrained("ashenR/AshenBERTo") |
|
|
|
# Create a fill-mask pipeline |
|
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) |
|
|
|
# Try it out with a Sinhala sentence! 🇱🇰 |
|
fill_mask("මම ගෙදර <mask>.") |
|
``` |