Fill-Mask
Transformers
PyTorch
bert
Generated from Trainer
Inference Endpoints
File size: 3,012 Bytes
e34ed1f
9c6c06c
6105845
e886417
 
 
 
 
8532443
6105845
 
cf3dc9c
 
 
2e6fdcd
e34ed1f
6105845
8532443
6105845
443b1d7
8532443
6105845
443b1d7
6105845
443b1d7
d4d6e12
6105845
 
f0781a9
6105845
443b1d7
6105845
f0781a9
 
443b1d7
 
 
 
 
 
 
 
 
 
f0781a9
 
443b1d7
 
 
 
6105845
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
443b1d7
 
 
 
 
 
 
a0d8f5e
443b1d7
 
 
 
 
e886417
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: cc-by-sa-4.0
base_model: indobenchmark/indobert-large-p2
language:
- min
- ban
- bug
- id
pretty_name: IndoBERTNusa
tags:
- generated_from_trainer
datasets:
- prosa-text/nusa-dialogue
- indonlp/NusaX-MT
pipeline_tag: fill-mask
---

# IndoBERTNusa (IndoBERT Adapted for Balinese, Buginese, and Minangkabau)

This repository contains a language adaptation and fine-tuning of the Indobenchmark IndoBERT language model for three specific languages: Balinese, Buginese, and Minangkabau.
The adaptation was performed using [nusa-translation](https://huggingface.co/datasets/prosa-text/nusa-translation) dataset.

## Model Details

- **Base Model**: [indobenchmark/indobert-large-p2](https://huggingface.co/indobenchmark/indobert-large-p2)
- **Adaptation Data**: [nusa-translation](https://huggingface.co/datasets/prosa-text/nusa-translation) 


## Performance Comparison / Benchmark

### Topic Classification

We tested the model after it was fine-tuned for topic classification using [nusa-dialogue](https://huggingface.co/datasets/prosa-text/nusa-dialogue) dataset.


| Language    | indobert-large-p2 (F1) | indobert-nusa (F1)     |
|-------------|------------------------|------------------------|
| Balinese    | 82.37                  | **84.23**              |
| Buginese    | 80.53                  | **82.03**              |
| Minangkabau | 84.49                  | **86.30**              |


### Language Identification

We also tested the model after it was fine-tuned for language identification using [nusaX](https://github.com/IndoNLP/nusax) dataset.

| Model                | F1-score     |
|----------------------|--------------|
| indobert-large-p2    | 98.21        |
| **indober-nusa**     | **98.45**    |


## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0


### Framework versions

- Transformers 4.33.1
- Pytorch 2.1.2+cu121
- Datasets 2.16.1
- Tokenizers 0.13.3


## Additional Information

### Licensing Information
The dataset is released under the terms of **CC-BY-SA 4.0**.
By using this model, you are also bound to the respective Terms of Use and License of the dataset.
For commercial use in small businesses and startups, please contact us ([email protected]) for permission to use the datasets by informing company profile and propose of usage.

### Acknowledgement
This research work is funded and supported by The Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH and FAIR Forward - Artificial Intelligence for all. We thank Direktorat Jenderal Pendidikan Tinggi, Riset, dan Teknologi Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi (Ditjen DIKTI) for providing the computing resources for this project.

### Contact Us
If you have any question please contact our support team at `[email protected]`.