File size: 3,027 Bytes
4ac08c0
801024f
 
 
4ac08c0
801024f
 
 
 
 
4ac08c0
 
801024f
 
 
 
 
4ac08c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ceb183f
4ac08c0
 
 
 
 
 
 
 
 
801024f
 
ceb183f
803056d
801024f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
language:
- ms
- id
tags:
- roberta
- fine-tuned
- transformers
- bert
- masked-language-model
license: apache-2.0
model_type: roberta
metrics:
- accuracy
base_model:
- mesolitica/roberta-base-bahasa-cased
pipeline_tag: token-classification
---

# Fine-tuned RoBERTa on Malay Language

This model is a fine-tuned version of the `mesolitica/roberta-base-bahasa-cased` model, specifically trained on a custom Malay dataset. The model is fine-tuned for **Masked Language Modeling (MLM)** on normalized Malay sentences.

## Model Description

This model is based on the **RoBERTa** architecture, a robustly optimized version of BERT. It was pre-trained on a large corpus of text in the Malay language and then fine-tuned on a specialized dataset consisting of normalized Malay sentences. The fine-tuning task involved predicting masked tokens in sentences, which is typical for masked language modeling tasks.

### Training Details

- **Pre-trained Model**: `mesolitica/roberta-base-bahasa-cased`
- **Task**: Masked Language Modeling (MLM)
- **Training Dataset**: Custom dataset of Malay sentences
- **Training Duration**: 3 epochs
- **Batch Size**: 16 per device
- **Learning Rate**: 1e-6
- **Optimizer**: AdamW
- **Evaluation**: Evaluated every 200 steps

## Training and Validation Loss

The following table shows the training and validation loss at each evaluation step during the fine-tuning process:

| Step  | Training Loss | Validation Loss |
|-------|---------------|-----------------|
| 200   | 0.069000      | 0.069317        |
| 800   | 0.070100      | 0.067430        |
| 1400  | 0.069000      | 0.066185        |
| 2000  | 0.037900      | 0.066657        |
| 2600  | 0.040200      | 0.066858        |
| 3200  | 0.041800      | 0.066634        |
| 3800  | 0.023700      | 0.067717        |
| 4400  | 0.024500      | 0.068275        |
| 5000  | 0.024500      | 0.068108        |


### Observations
- The training loss consistently decreased over time, with notable reductions in the earlier steps.
- The validation loss showed slight fluctuations, but overall, it remained relatively stable after the first few thousand steps.
- The model demonstrated good convergence as training progressed, with a sharp drop in the training loss after the first few steps.


## Intended Use
This model is intended for tasks such as:
- **Masked Language Modeling (MLM)**: Fill in the blanks for masked tokens in a Malay sentence.
- **Text Generation**: Generate plausible text given a context.
- **Text Understanding**: Extract contextual meaning from Malay sentences.

## Updated News
- This model is used for the research paper : **"Mitigating Linguistic Bias between Malay and Indonesian Languages using Masked Language Models"** which been accepted as a short paper (poster presentation) for the **Research Track** at **DASFAA 2025**.
- **Author**: Ferdinand Lenchau Bit, Iman Khaleda binti Zamri, Amzine Toushik Wasi, Taki Hasan Rafi, and Dong-Kyu Chae (Department of Computer Science, Hanyang University, Seoul, South Korea)