IndoBERTweet Multiclass with Balanced Augmented Dataset Model

Overview

This repository contains a trained IndoBERTweet model for text classification. The model has been trained and evaluated on a balanced dataset comprising various labels such as Politics, Social Culture, Defense and Security, Ideology, Economy, Natural Resources, Demography, and Geography.

Dataset Information

Before Augmentation/Balancing

Label Count
Politics 2972
Social Culture 587
Defense and Security 400
Ideology 400
Economy 367
Natural Resources 192
Demography 62
Geography 20

After Balancing

Label Count
Politics 2969
Demography 427
Social Culture 422
Ideology 343
Defense and Security 331
Economy 309
Natural Resources 156
Geography 133

Label Encoding

Encoded Label
0 Demography
1 Economy
2 Geography
3 Ideology
4 Defense and Security
5 Politics
6 Social Culture
7 Natural Resources

Data Split

  • Train Size: 4326 samples (85%)
  • Test Size: 764 samples (15%)

Model Training Log

Epoch 1/4

  • Train Loss: 1.0651 | Train Accuracy: 0.6700
  • Test Loss: 0.8339 | Test Accuracy: 0.7313

Epoch 2/4

  • Train Loss: 0.6496 | Train Accuracy: 0.7879
  • Test Loss: 0.6988 | Test Accuracy: 0.7717

Epoch 3/4

  • Train Loss: 0.4223 | Train Accuracy: 0.8736
  • Test Loss: 0.7308 | Test Accuracy: 0.7704

Epoch 4/4

  • Train Loss: 0.2764 | Train Accuracy: 0.9150
  • Test Loss: 0.7615 | Test Accuracy: 0.7826

Training Completed

Model Evaluation

  • Precision Score: 0.7836
  • Recall Score: 0.7827
  • F1 Score: 0.7820

Classification Report

Label Precision Recall F1-Score Support
Demography 0.90 0.94 0.92 64
Economy 0.70 0.67 0.69 46
Geography 0.95 0.90 0.92 20
Ideology 0.72 0.56 0.63 52
Defense and Security 0.73 0.66 0.69 50
Politics 0.84 0.86 0.85 446
Social Culture 0.43 0.48 0.45 63
Natural Resources 0.61 0.61 0.61 23
  • Accuracy Score: 0.7827
  • Balanced Accuracy Score: 0.7091
  • Macro Average: 0.74 (Precision), 0.71 (Recall), 0.72 (F1-Score)
  • Weighted Average: 0.78 (Precision), 0.78 (Recall), 0.78 (F1-Score)

How to Use the Model

To use this model, you can load it using the transformers library from Hugging Face.

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="Rendika/Trained-indobertweet-balanced-dataset")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")
model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-indobertweet-balanced-dataset")

Conclusion

This IndoBERTweet model is fine-tuned on a balanced dataset to enhance its performance across different categories. The model demonstrates good performance metrics, making it suitable for a variety of text classification tasks in the Indonesian language.

Feel free to use and contribute to this repository. For any issues or suggestions, please open an issue on the repository or contact the maintainer.


Downloads last month
120
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.