File size: 3,945 Bytes
7114794 11c3172 89c1475 15dd611 44288b1 89c1475 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
license: apache-2.0
datasets:
- KDAI-NLP/traffy-fondue-type-only
language:
- th
metrics:
- f1
tags:
- roberta
widget:
- text: "แยกอโศกฝนตกน้ำท่วมหนักมากครับ ต้นไม้ก็ล้มขวางทางรถติดชห"
---
# Traffy Complaint Classification
This multi-label model is trained to automatically classify various types of traffic complaints expressed in Thai text,
with the goal of minimizing the need for manual classification. Please note that the example inference provided by Hugging Face (Right-side UI)
does not yet support multi-label classification. If you require multi-label classification, please use the code provided below.
### Model Details
Model Name: KDAI-NLP/wangchanberta-traffy-multi
Tokenizer: airesearch/wangchanberta-base-att-spm-uncased
License: Apache License 2.0
### How to Use
```python
!pip install sentencepiece
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import sigmoid
import json
# Target lists
target_list = [
'ความสะอาด', 'สายไฟ', 'สะพาน', 'ถนน', 'น้ำท่วม',
'ร้องเรียน', 'ท่อระบายน้ำ', 'ความปลอดภัย', 'คลอง', 'แสงสว่าง',
'ทางเท้า', 'จราจร', 'กีดขวาง', 'การเดินทาง', 'เสียงรบกวน',
'ต้นไม้', 'สัตว์จรจัด', 'เสนอแนะ', 'คนจรจัด', 'ห้องน้ำ',
'ป้ายจราจร', 'สอบถาม', 'ป้าย', 'PM2.5'
]
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
model = AutoModelForSequenceClassification.from_pretrained("KDAI-NLP/wangchanberta-traffy-multi")
# Example text to classify
text = "ช่วยด้วยครับถนนน้ำท่วมอีกแล้ว ต้นไม้ก็ล้มขวางทาง กลับบ้านไม่ได้"
# Encode the text using the tokenizer
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Get model predictions (logits)
with torch.no_grad():
logits = model(**inputs).logits
# Apply sigmoid function to convert logits to probabilities
probabilities = sigmoid(logits)
# Map probabilities to corresponding labels
probabilities = probabilities.squeeze().tolist()
label_probabilities = zip(target_list, probabilities)
# Print labels with probabilities
for label, probability in label_probabilities:
print(f"{label}: {probability:.4f}")
# Or JSON
# Create a dictionary for labels and probabilities
results_dict = {label: probability for label, probability in label_probabilities}
# Convert dictionary to JSON string
results_json = json.dumps(results_dict, ensure_ascii=False, indent=4)
# Print the JSON string
print(results_json)
```
## Training Details
The model was trained on traffic complaint data API (included stopwords) using the airesearch/wangchanberta-base-att-spm-uncased base model. This is a multi-label classification task with a total of 24 classes.
## Training Scores
| Model | Stopword | Epoch | Training Loss | Validation Loss | F1 | Accuracy |
| ---------------------------------- | -------- | ----- | ------------- | --------------- | ------- | -------- |
| wangchanberta-base-att-spm-uncased | Included | 0 | 0.0322 | 0.034822 | 0.7015 | 0.7569 |
| wangchanberta-base-att-spm-uncased | Included | 2 | 0.0207 | 0.026364 | 0.8405 | 0.7821 |
| wangchanberta-base-att-spm-uncased | Included | 4 | 0.0165 | 0.025142 | 0.8458 | 0.7934 |
Feel free to customize the README further if needed. |