metadata

license: mit
language:
  - id
  - en
metrics:
  - accuracy
  - recall
  - precision
  - confusion_matrix
pipeline_tag: text-classification
tags:
  - presidential election
  - indonesia
  - multiclass

Berikut adalah README.txt yang menarik dan informatif untuk model yang telah Anda unggah ke Kaggle Model Hub:

Fine-tuned DistilBERT Model for Indonesian Text Classification

Overview

This repository contains a fine-tuned version of the DistilBERT model (based on cahya/distilbert-base-indonesian) for Indonesian text classification. The model is trained to classify text into eight distinct categories, including politics, socio-cultural, defense and security, ideology, economy, natural resources, demography, and geography.

Dataset

The dataset used for training the model underwent significant augmentation and balancing to address class imbalance issues. Below are the details of the dataset before and after augmentation:

Before Augmentation

Category	Count
Politik	2972
Sosial Budaya	587
Pertahanan dan Keamanan	400
Ideologi	400
Ekonomi	367
Sumber Daya Alam	192
Demografi	62
Geografi	20

After Augmentation

Category	Count
Politik	2969
Demografi	427
Sosial Budaya	422
Ideologi	343
Pertahanan dan Keamanan	331
Ekonomi	309
Sumber Daya Alam	156
Geografi	133

Label Encoding

Encoded	Label
0	Demografi
1	Ekonomi
2	Geografi
3	Ideologi
4	Pertahanan dan Keamanan
5	Politik
6	Sosial Budaya
7	Sumber Daya Alam

Data Split

The dataset was split into training and testing sets with an 85:15 ratio.

Train Size: 4326 samples
Test Size: 764 samples

Model Training

The model was trained for 4 epochs, achieving the following results:

Epoch	Train Loss	Train Accuracy
1	1.0240	0.6766
2	0.5615	0.8220
3	0.3270	0.9014
4	0.1759	0.9481

Training Completion

Test Loss: 0.7948
Test Accuracy: 0.7687
Test Balanced Accuracy: 0.7001

Model Evaluation

The model was evaluated using precision, recall, and F1 scores, with the following results:

Precision Score: 0.7714
Recall Score: 0.7696
F1 Score: 0.7697

Classification Report

Category	Precision	Recall	F1-Score	Support
Demografi	0.94	0.91	0.92	64
Ekonomi	0.67	0.72	0.69	46
Geografi	0.95	0.95	0.95	20
Ideologi	0.71	0.56	0.62	52
Pertahanan dan Keamanan	0.69	0.66	0.67	50
Politik	0.84	0.85	0.84	446
Sosial Budaya	0.38	0.40	0.39	63
Sumber Daya Alam	0.50	0.57	0.53	23

Accuracy: 0.7696
Balanced Accuracy: 0.7001
Macro Avg Precision: 0.71
Macro Avg Recall: 0.70
Macro Avg F1-Score: 0.70
Weighted Avg Precision: 0.77
Weighted Avg Recall: 0.77
Weighted Avg F1-Score: 0.77

Usage

To use this model, you can load it using the Hugging Face Transformers library:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")

Conclusion

This fine-tuned DistilBERT model for Indonesian text classification demonstrates robust performance across various categories. The augmentation and balancing of the dataset have contributed significantly to the model's ability to generalize well on the test set.

Feel free to use this model for your Indonesian text classification tasks, and don't hesitate to reach out if you have any questions or feedback.