File size: 3,706 Bytes
7114794
 
 
 
 
 
 
 
 
 
11c3172
 
89c1475
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
datasets:
- KDAI-NLP/traffy-fondue-type-only
language:
- th
metrics:
- f1
tags:
- roberta
widget:
- text: "แยกอโศกฝนตกน้ำท่วมหนักมากครับ ต้นไม้ก็ล้มขวางทางรถติดชห"
---

# Traffy Complaint Classification

This model is trained to automatically classify types of traffic complaints in Thai text, aiming to reduce the need for manual classification by humans.

### Model Details

    Model Name: KDAI-NLP/wangchanberta-traffy-multi
    Tokenizer: airesearch/wangchanberta-base-att-spm-uncased
    License: Apache License 2.0

### How to Use

```python

!pip install sentencepiece

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import sigmoid
import json

# Target lists
target_list = [
    'ความสะอาด', 'สายไฟ', 'สะพาน', 'ถนน', 'น้ำท่วม',
    'ร้องเรียน', 'ท่อระบายน้ำ', 'ความปลอดภัย', 'คลอง', 'แสงสว่าง',
    'ทางเท้า', 'จราจร', 'กีดขวาง', 'การเดินทาง', 'เสียงรบกวน',
    'ต้นไม้', 'สัตว์จรจัด', 'เสนอแนะ', 'คนจรจัด', 'ห้องน้ำ',
    'ป้ายจราจร', 'สอบถาม', 'ป้าย', 'PM2.5'
]

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
model = AutoModelForSequenceClassification.from_pretrained("KDAI-NLP/wangchanberta-traffy-multi")

# Example text to classify
text = "ช่วยด้วยครับถนนน้ำท่วมอีกแล้ว ต้นไม้ก็ล้มขวางทาง กลับบ้านไม่ได้"

# Encode the text using the tokenizer
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Get model predictions (logits)
with torch.no_grad():
    logits = model(**inputs).logits

# Apply sigmoid function to convert logits to probabilities
probabilities = sigmoid(logits)

# Map probabilities to corresponding labels
probabilities = probabilities.squeeze().tolist()
label_probabilities = zip(target_list, probabilities)

# Print labels with probabilities
for label, probability in label_probabilities:
    print(f"{label}: {probability:.4f}")

# Or JSON
# Create a dictionary for labels and probabilities
results_dict = {label: probability for label, probability in label_probabilities}

# Convert dictionary to JSON string
results_json = json.dumps(results_dict, ensure_ascii=False, indent=4)

# Print the JSON string
print(results_json)
```

## Training Details

The model was trained on traffic complaint data API (included stopwords) using the airesearch/wangchanberta-base-att-spm-uncased base model. This is a multi-label classification task with a total of 24 classes.

## Training Scores

| Model                              | Stopword | Epoch | Training Loss | Validation Loss | F1      | Accuracy |
| ---------------------------------- | -------- | ----- | ------------- | --------------- | ------- | -------- |
| wangchanberta-base-att-spm-uncased | Included | 0     | 0.0322        | 0.034822        | 0.7015  | 0.7569   |
| wangchanberta-base-att-spm-uncased | Included | 2     | 0.0207        | 0.026364        | 0.8405  | 0.7821   |
| wangchanberta-base-att-spm-uncased | Included | 4     | 0.0165        | 0.025142        | 0.8458  | 0.7934   |


Feel free to customize the README further if needed.