license: cc-by-sa-4.0
base_model: indobenchmark/indobert-large-p2
tags:
- generated_from_trainer
datasets:
- prosa-text/nusa-dialogue
- indonlp/NusaX-MT
IndoBERT-nusa (IndoBERT Adapted for Balinese, Buginese, and Minangkabau)
This repository contains a language adaptation and fine-tuning of the Indobenchmark IndoBERT language model for three specific languages: Balinese, Buginese, and Minangkabau. The adaptation was performed using nusa-st data.
Model Details
- Base Model: indobenchmark/indobert-large-p2
- Adaptation Data: nusa-st
Performance Comparison / Benchmark
Topic Classification
We tested the model after it was fine-tuned for topic classification using nusa-dialogue dataset.
Language | indobert-large-p2 (F1) | indobert-nusa (F1) |
---|---|---|
Balinese | 82.37 | 84.23 |
Buginese | 80.53 | 82.03 |
Minangkabau | 84.49 | 86.30 |
Language Identification
We also tested the model after it was fine-tuned for language identification using nusaX dataset.
Model | F1-score |
---|---|
indobert-large-p2 | 98.21 |
indober-nusa | 98.45 |
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0
Framework versions
- Transformers 4.33.1
- Pytorch 2.1.2+cu121
- Datasets 2.16.1
- Tokenizers 0.13.3
Additional Information
Licensing Information
The dataset is released under the terms of CC-BY-SA 4.0. By using this model, you are also bound to the respective Terms of Use and License of the dataset.
Citation Information
@article{purwarianti2023nusadialogue,
title={NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages},
author={Purwarianti, Ayu and Adhista, Dea and Baptiso, Agung and Mahfuzh, Miftahul and Yusrina Sabila and Cahyawijaya, Samuel and Aji, Alham Fikri},
journal={arXiv preprint arXiv:(coming soon)},
url={https://huggingface.co/datasets/prosa-text/nusa-dialogue},
year={2023}
}
Acknowledgement
This research work is funded and supported by The Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH and FAIR Forward - Artificial Intelligence for all. We thank Direktorat Jenderal Pendidikan Tinggi, Riset, dan Teknologi Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi (Ditjen DIKTI) for providing the computing resources for this project.
Contact Us
If you have any question please contact our support team at [email protected]
.