|
--- |
|
language: |
|
- ms |
|
- id |
|
tags: |
|
- roberta |
|
- fine-tuned |
|
- transformers |
|
- bert |
|
- masked-language-model |
|
license: apache-2.0 |
|
model_type: roberta |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- mesolitica/roberta-base-bahasa-cased |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# Fine-tuned RoBERTa on Malay Language |
|
|
|
This model is a fine-tuned version of the `mesolitica/roberta-base-bahasa-cased` model, specifically trained on a custom Malay dataset. The model is fine-tuned for **Masked Language Modeling (MLM)** on normalized Malay sentences. |
|
|
|
## Model Description |
|
|
|
This model is based on the **RoBERTa** architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks. |
|
|
|
### Training Details |
|
|
|
- **Pre-trained Model**: `mesolitica/roberta-base-bahasa-cased` |
|
- **Task**: Masked Language Modeling (MLM) |
|
- **Training Dataset**: Custom dataset of Malay sentences |
|
- **Training Duration**: 3 epochs |
|
- **Batch Size**: 16 per device |
|
- **Learning Rate**: 1e-6 |
|
- **Optimizer**: AdamW |
|
- **Evaluation**: Evaluated every 200 steps |
|
|
|
## Training and Validation Loss |
|
|
|
The following table shows the training and validation loss at each evaluation step during the fine-tuning process: |
|
|
|
| Step | Training Loss | Validation Loss | |
|
|-------|---------------|-----------------| |
|
| 200 | 0.069000 | 0.069317 | |
|
| 800 | 0.070100 | 0.067430 | |
|
| 1400 | 0.069000 | 0.066185 | |
|
| 2000 | 0.037900 | 0.066657 | |
|
| 2600 | 0.040200 | 0.066858 | |
|
| 3200 | 0.041800 | 0.066634 | |
|
| 3800 | 0.023700 | 0.067717 | |
|
| 4400 | 0.024500 | 0.068275 | |
|
| 5000 | 0.024500 | 0.068108 | |
|
|
|
|
|
### Observations |
|
- The training loss consistently decreased over time, with notable reductions in the earlier steps. |
|
- The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps. |
|
- The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps. |
|
|
|
|
|
## Intended Use |
|
This model is intended for tasks such as: |
|
- **Masked Language Modeling (MLM)**: Fill in the blanks for masked tokens in a Malay sentence. |
|
- **Text Generation**: Generate plausible text given a context. |
|
- **Text Understanding**: Extract contextual meaning from Malay sentences. |
|
|
|
## Updated News |
|
- This model is used for the research paper : **"Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models"** which been accepted as a short paper (poster presentation) for the **Research Track** at **DASFAA 2025**. |
|
- **Author**: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea) |