Update README.md

b023498 verified 7 months ago

5.04 kB

	---
	license: mit
	language:
	- id
	- en
	metrics:
	- accuracy
	- recall
	- precision
	- confusion_matrix
	pipeline_tag: text-classification
	tags:
	- presidential election
	- indonesia
	- multiclass
	---



	Berikut adalah README.txt yang menarik dan informatif untuk model yang telah Anda unggah ke Kaggle Model Hub:

	---

	# Fine-tuned DistilBERT Model for Indonesian Text Classification

	## Overview

	This repository contains a fine-tuned version of the DistilBERT model (based on [cahya/distilbert-base-indonesian](https://huggingface.co/cahya/distilbert-base-indonesian)) for Indonesian text classification. The model is trained to classify text into eight distinct categories, including politics, socio-cultural, defense and security, ideology, economy, natural resources, demography, and geography.

	## Dataset

	The dataset used for training the model underwent significant augmentation and balancing to address class imbalance issues. Below are the details of the dataset before and after augmentation:

	### Before Augmentation
	\| Category \| Count \|
	\|-------------------------\|-------\|
	\| Politik \| 2972 \|
	\| Sosial Budaya \| 587 \|
	\| Pertahanan dan Keamanan \| 400 \|
	\| Ideologi \| 400 \|
	\| Ekonomi \| 367 \|
	\| Sumber Daya Alam \| 192 \|
	\| Demografi \| 62 \|
	\| Geografi \| 20 \|

	### After Augmentation
	\| Category \| Count \|
	\|-------------------------\|-------\|
	\| Politik \| 2969 \|
	\| Demografi \| 427 \|
	\| Sosial Budaya \| 422 \|
	\| Ideologi \| 343 \|
	\| Pertahanan dan Keamanan \| 331 \|
	\| Ekonomi \| 309 \|
	\| Sumber Daya Alam \| 156 \|
	\| Geografi \| 133 \|

	## Label Encoding

	\| Encoded \| Label \|
	\|---------\|---------------------------\|
	\| 0 \| Demografi \|
	\| 1 \| Ekonomi \|
	\| 2 \| Geografi \|
	\| 3 \| Ideologi \|
	\| 4 \| Pertahanan dan Keamanan \|
	\| 5 \| Politik \|
	\| 6 \| Sosial Budaya \|
	\| 7 \| Sumber Daya Alam \|

	## Data Split

	The dataset was split into training and testing sets with an 85:15 ratio.

	- Train Size: 4326 samples
	- Test Size: 764 samples

	## Model Training

	The model was trained for 4 epochs, achieving the following results:

	\| Epoch \| Train Loss \| Train Accuracy \|
	\|-------\|------------\|----------------\|
	\| 1 \| 1.0240 \| 0.6766 \|
	\| 2 \| 0.5615 \| 0.8220 \|
	\| 3 \| 0.3270 \| 0.9014 \|
	\| 4 \| 0.1759 \| 0.9481 \|

	### Training Completion
	- Test Loss: 0.7948
	- Test Accuracy: 0.7687
	- Test Balanced Accuracy: 0.7001

	## Model Evaluation

	The model was evaluated using precision, recall, and F1 scores, with the following results:

	- Precision Score: 0.7714
	- Recall Score: 0.7696
	- F1 Score: 0.7697

	### Classification Report

	\| Category \| Precision \| Recall \| F1-Score \| Support \|
	\|-------------------------\|-----------\|--------\|----------\|---------\|
	\| Demografi \| 0.94 \| 0.91 \| 0.92 \| 64 \|
	\| Ekonomi \| 0.67 \| 0.72 \| 0.69 \| 46 \|
	\| Geografi \| 0.95 \| 0.95 \| 0.95 \| 20 \|
	\| Ideologi \| 0.71 \| 0.56 \| 0.62 \| 52 \|
	\| Pertahanan dan Keamanan \| 0.69 \| 0.66 \| 0.67 \| 50 \|
	\| Politik \| 0.84 \| 0.85 \| 0.84 \| 446 \|
	\| Sosial Budaya \| 0.38 \| 0.40 \| 0.39 \| 63 \|
	\| Sumber Daya Alam \| 0.50 \| 0.57 \| 0.53 \| 23 \|

	- Accuracy: 0.7696
	- Balanced Accuracy: 0.7001
	- Macro Avg Precision: 0.71
	- Macro Avg Recall: 0.70
	- Macro Avg F1-Score: 0.70
	- Weighted Avg Precision: 0.77
	- Weighted Avg Recall: 0.77
	- Weighted Avg F1-Score: 0.77

	## Usage

	To use this model, you can load it using the Hugging Face Transformers library:

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	pipe = pipeline("text-classification", model="Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")

	# Load model directly
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")
	model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")
	```

	## Conclusion

	This fine-tuned DistilBERT model for Indonesian text classification demonstrates robust performance across various categories. The augmentation and balancing of the dataset have contributed significantly to the model's ability to generalize well on the test set.

	Feel free to use this model for your Indonesian text classification tasks, and don't hesitate to reach out if you have any questions or feedback.

	---

	---
	license: mit
	language:
	- id
	- en
	metrics:
	- accuracy
	- recall
	- precision
	- confusion_matrix
	pipeline_tag: text-classification
	tags:
	- presidential election
	- indonesia
	- multiclass
	---



	Berikut adalah README.txt yang menarik dan informatif untuk model yang telah Anda unggah ke Kaggle Model Hub:

	---

	# Fine-tuned DistilBERT Model for Indonesian Text Classification

	## Overview

	This repository contains a fine-tuned version of the DistilBERT model (based on [cahya/distilbert-base-indonesian](https://huggingface.co/cahya/distilbert-base-indonesian)) for Indonesian text classification. The model is trained to classify text into eight distinct categories, including politics, socio-cultural, defense and security, ideology, economy, natural resources, demography, and geography.

	## Dataset

	The dataset used for training the model underwent significant augmentation and balancing to address class imbalance issues. Below are the details of the dataset before and after augmentation:

	### Before Augmentation
	\| Category \| Count \|
	\|-------------------------\|-------\|
	\| Politik \| 2972 \|
	\| Sosial Budaya \| 587 \|
	\| Pertahanan dan Keamanan \| 400 \|
	\| Ideologi \| 400 \|
	\| Ekonomi \| 367 \|
	\| Sumber Daya Alam \| 192 \|
	\| Demografi \| 62 \|
	\| Geografi \| 20 \|

	### After Augmentation
	\| Category \| Count \|
	\|-------------------------\|-------\|
	\| Politik \| 2969 \|
	\| Demografi \| 427 \|
	\| Sosial Budaya \| 422 \|
	\| Ideologi \| 343 \|
	\| Pertahanan dan Keamanan \| 331 \|
	\| Ekonomi \| 309 \|
	\| Sumber Daya Alam \| 156 \|
	\| Geografi \| 133 \|

	## Label Encoding

	\| Encoded \| Label \|
	\|---------\|---------------------------\|
	\| 0 \| Demografi \|
	\| 1 \| Ekonomi \|
	\| 2 \| Geografi \|
	\| 3 \| Ideologi \|
	\| 4 \| Pertahanan dan Keamanan \|
	\| 5 \| Politik \|
	\| 6 \| Sosial Budaya \|
	\| 7 \| Sumber Daya Alam \|

	## Data Split

	The dataset was split into training and testing sets with an 85:15 ratio.

	- Train Size: 4326 samples
	- Test Size: 764 samples

	## Model Training

	The model was trained for 4 epochs, achieving the following results:

	\| Epoch \| Train Loss \| Train Accuracy \|
	\|-------\|------------\|----------------\|
	\| 1 \| 1.0240 \| 0.6766 \|
	\| 2 \| 0.5615 \| 0.8220 \|
	\| 3 \| 0.3270 \| 0.9014 \|
	\| 4 \| 0.1759 \| 0.9481 \|

	### Training Completion
	- Test Loss: 0.7948
	- Test Accuracy: 0.7687
	- Test Balanced Accuracy: 0.7001

	## Model Evaluation

	The model was evaluated using precision, recall, and F1 scores, with the following results:

	- Precision Score: 0.7714
	- Recall Score: 0.7696
	- F1 Score: 0.7697

	### Classification Report

	\| Category \| Precision \| Recall \| F1-Score \| Support \|
	\|-------------------------\|-----------\|--------\|----------\|---------\|
	\| Demografi \| 0.94 \| 0.91 \| 0.92 \| 64 \|
	\| Ekonomi \| 0.67 \| 0.72 \| 0.69 \| 46 \|
	\| Geografi \| 0.95 \| 0.95 \| 0.95 \| 20 \|
	\| Ideologi \| 0.71 \| 0.56 \| 0.62 \| 52 \|
	\| Pertahanan dan Keamanan \| 0.69 \| 0.66 \| 0.67 \| 50 \|
	\| Politik \| 0.84 \| 0.85 \| 0.84 \| 446 \|
	\| Sosial Budaya \| 0.38 \| 0.40 \| 0.39 \| 63 \|
	\| Sumber Daya Alam \| 0.50 \| 0.57 \| 0.53 \| 23 \|

	- Accuracy: 0.7696
	- Balanced Accuracy: 0.7001
	- Macro Avg Precision: 0.71
	- Macro Avg Recall: 0.70
	- Macro Avg F1-Score: 0.70
	- Weighted Avg Precision: 0.77
	- Weighted Avg Recall: 0.77
	- Weighted Avg F1-Score: 0.77

	## Usage

	To use this model, you can load it using the Hugging Face Transformers library:

	```python
	# Use a pipeline as a high-level helper
	from transformers import pipeline

	pipe = pipeline("text-classification", model="Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")

	# Load model directly
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")
	model = AutoModelForSequenceClassification.from_pretrained("Rendika/Trained-DistilBERT-Indonesia-Presidential-Election-Balanced-Dataset")
	```

	## Conclusion

	This fine-tuned DistilBERT model for Indonesian text classification demonstrates robust performance across various categories. The augmentation and balancing of the dataset have contributed significantly to the model's ability to generalize well on the test set.

	Feel free to use this model for your Indonesian text classification tasks, and don't hesitate to reach out if you have any questions or feedback.

	---