--- language: - ms - id tags: - roberta - fine-tuned - transformers - bert - masked-language-model license: apache-2.0 model_type: roberta metrics: - accuracy base_model: - mesolitica/roberta-base-bahasa-cased pipeline_tag: token-classification --- # Fine-tuned RoBERTa on Malay Language This model is a fine-tuned version of the `mesolitica/roberta-base-bahasa-cased` model, specifically trained on a custom Malay dataset. The model is fine-tuned for **Masked Language Modeling (MLM)** on normalized Malay sentences. ## Model Description This model is based on the **RoBERTa** architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks. ### Training Details - **Pre-trained Model**: `mesolitica/roberta-base-bahasa-cased` - **Task**: Masked Language Modeling (MLM) - **Training Dataset**: Custom dataset of Malay sentences - **Training Duration**: 3 epochs - **Batch Size**: 16 per device - **Learning Rate**: 1e-6 - **Optimizer**: AdamW - **Evaluation**: Evaluated every 200 steps ## Training and Validation Loss The following table shows the training and validation loss at each evaluation step during the fine-tuning process: | Step | Training Loss | Validation Loss | |-------|---------------|-----------------| | 200 | 0.069000 | 0.069317 | | 800 | 0.070100 | 0.067430 | | 1400 | 0.069000 | 0.066185 | | 2000 | 0.037900 | 0.066657 | | 2600 | 0.040200 | 0.066858 | | 3200 | 0.041800 | 0.066634 | | 3800 | 0.023700 | 0.067717 | | 4400 | 0.024500 | 0.068275 | | 5000 | 0.024500 | 0.068108 | ### Observations - The training loss consistently decreased over time, with notable reductions in the earlier steps. - The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps. - The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps. ## Intended Use This model is intended for tasks such as: - **Masked Language Modeling (MLM)**: Fill in the blanks for masked tokens in a Malay sentence. - **Text Generation**: Generate plausible text given a context. - **Text Understanding**: Extract contextual meaning from Malay sentences. ## Updated News - This model is used for the research paper : **"Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models"** which been accepted as a short paper (poster presentation) for the **Research Track** at **DASFAA 2025**. - **Author**: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea)