Rendika commited on
Commit
c259525
1 Parent(s): d644a39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -3
README.md CHANGED
@@ -1,3 +1,147 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - id
5
+ - en
6
+ metrics:
7
+ - accuracy
8
+ - recall
9
+ - precision
10
+ - confusion_matrix
11
+ pipeline_tag: text-classification
12
+ tags:
13
+ - presidential election
14
+ - indonesia
15
+ - multiclass
16
+ ---
17
+
18
+
19
+
20
+ Berikut adalah README.txt yang menarik dan informatif untuk model yang telah Anda unggah ke Kaggle Model Hub:
21
+
22
+ ---
23
+
24
+ # Fine-tuned DistilBERT Model for Indonesian Text Classification
25
+
26
+ ## Overview
27
+
28
+ This repository contains a fine-tuned version of the DistilBERT model (based on [cahya/distilbert-base-indonesian](https://huggingface.co/cahya/distilbert-base-indonesian)) for Indonesian text classification. The model is trained to classify text into eight distinct categories, including politics, socio-cultural, defense and security, ideology, economy, natural resources, demography, and geography.
29
+
30
+ ## Dataset
31
+
32
+ The dataset used for training the model underwent significant augmentation and balancing to address class imbalance issues. Below are the details of the dataset before and after augmentation:
33
+
34
+ ### Before Augmentation
35
+ | Category | Count |
36
+ |-------------------------|-------|
37
+ | Politik | 2972 |
38
+ | Sosial Budaya | 587 |
39
+ | Pertahanan dan Keamanan | 400 |
40
+ | Ideologi | 400 |
41
+ | Ekonomi | 367 |
42
+ | Sumber Daya Alam | 192 |
43
+ | Demografi | 62 |
44
+ | Geografi | 20 |
45
+
46
+ ### After Augmentation
47
+ | Category | Count |
48
+ |-------------------------|-------|
49
+ | Politik | 2969 |
50
+ | Demografi | 427 |
51
+ | Sosial Budaya | 422 |
52
+ | Ideologi | 343 |
53
+ | Pertahanan dan Keamanan | 331 |
54
+ | Ekonomi | 309 |
55
+ | Sumber Daya Alam | 156 |
56
+ | Geografi | 133 |
57
+
58
+ ## Label Encoding
59
+
60
+ | Encoded | Label |
61
+ |---------|---------------------------|
62
+ | 0 | Demografi |
63
+ | 1 | Ekonomi |
64
+ | 2 | Geografi |
65
+ | 3 | Ideologi |
66
+ | 4 | Pertahanan dan Keamanan |
67
+ | 5 | Politik |
68
+ | 6 | Sosial Budaya |
69
+ | 7 | Sumber Daya Alam |
70
+
71
+ ## Data Split
72
+
73
+ The dataset was split into training and testing sets with an 85:15 ratio.
74
+
75
+ - **Train Size:** 4326 samples
76
+ - **Test Size:** 764 samples
77
+
78
+ ## Model Training
79
+
80
+ The model was trained for 4 epochs, achieving the following results:
81
+
82
+ | Epoch | Train Loss | Train Accuracy |
83
+ |-------|------------|----------------|
84
+ | 1 | 1.0240 | 0.6766 |
85
+ | 2 | 0.5615 | 0.8220 |
86
+ | 3 | 0.3270 | 0.9014 |
87
+ | 4 | 0.1759 | 0.9481 |
88
+
89
+ ### Training Completion
90
+ - **Test Loss:** 0.7948
91
+ - **Test Accuracy:** 0.7687
92
+ - **Test Balanced Accuracy:** 0.7001
93
+
94
+ ## Model Evaluation
95
+
96
+ The model was evaluated using precision, recall, and F1 scores, with the following results:
97
+
98
+ - **Precision Score:** 0.7714
99
+ - **Recall Score:** 0.7696
100
+ - **F1 Score:** 0.7697
101
+
102
+ ### Classification Report
103
+
104
+ | Category | Precision | Recall | F1-Score | Support |
105
+ |-------------------------|-----------|--------|----------|---------|
106
+ | Demografi | 0.94 | 0.91 | 0.92 | 64 |
107
+ | Ekonomi | 0.67 | 0.72 | 0.69 | 46 |
108
+ | Geografi | 0.95 | 0.95 | 0.95 | 20 |
109
+ | Ideologi | 0.71 | 0.56 | 0.62 | 52 |
110
+ | Pertahanan dan Keamanan | 0.69 | 0.66 | 0.67 | 50 |
111
+ | Politik | 0.84 | 0.85 | 0.84 | 446 |
112
+ | Sosial Budaya | 0.38 | 0.40 | 0.39 | 63 |
113
+ | Sumber Daya Alam | 0.50 | 0.57 | 0.53 | 23 |
114
+
115
+ - **Accuracy:** 0.7696
116
+ - **Balanced Accuracy:** 0.7001
117
+ - **Macro Avg Precision:** 0.71
118
+ - **Macro Avg Recall:** 0.70
119
+ - **Macro Avg F1-Score:** 0.70
120
+ - **Weighted Avg Precision:** 0.77
121
+ - **Weighted Avg Recall:** 0.77
122
+ - **Weighted Avg F1-Score:** 0.77
123
+
124
+ ## Usage
125
+
126
+ To use this model, you can load it using the Hugging Face Transformers library:
127
+
128
+ ```python
129
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
130
+
131
+ model_path = "path_to_your_model_directory_or_hub_repo"
132
+
133
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
134
+ model = AutoModelForSequenceClassification.from_pretrained(model_path)
135
+
136
+ # Example usage
137
+ inputs = tokenizer("Your text here", return_tensors="pt")
138
+ outputs = model(**inputs)
139
+ ```
140
+
141
+ ## Conclusion
142
+
143
+ This fine-tuned DistilBERT model for Indonesian text classification demonstrates robust performance across various categories. The augmentation and balancing of the dataset have contributed significantly to the model's ability to generalize well on the test set.
144
+
145
+ Feel free to use this model for your Indonesian text classification tasks, and don't hesitate to reach out if you have any questions or feedback.
146
+
147
+ ---