File size: 3,706 Bytes
7114794 11c3172 89c1475 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
license: apache-2.0
datasets:
- KDAI-NLP/traffy-fondue-type-only
language:
- th
metrics:
- f1
tags:
- roberta
widget:
- text: "แยกอโศกฝนตกน้ำท่วมหนักมากครับ ต้นไม้ก็ล้มขวางทางรถติดชห"
---
# Traffy Complaint Classification
This model is trained to automatically classify types of traffic complaints in Thai text, aiming to reduce the need for manual classification by humans.
### Model Details
Model Name: KDAI-NLP/wangchanberta-traffy-multi
Tokenizer: airesearch/wangchanberta-base-att-spm-uncased
License: Apache License 2.0
### How to Use
```python
!pip install sentencepiece
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.nn.functional import sigmoid
import json
# Target lists
target_list = [
'ความสะอาด', 'สายไฟ', 'สะพาน', 'ถนน', 'น้ำท่วม',
'ร้องเรียน', 'ท่อระบายน้ำ', 'ความปลอดภัย', 'คลอง', 'แสงสว่าง',
'ทางเท้า', 'จราจร', 'กีดขวาง', 'การเดินทาง', 'เสียงรบกวน',
'ต้นไม้', 'สัตว์จรจัด', 'เสนอแนะ', 'คนจรจัด', 'ห้องน้ำ',
'ป้ายจราจร', 'สอบถาม', 'ป้าย', 'PM2.5'
]
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
model = AutoModelForSequenceClassification.from_pretrained("KDAI-NLP/wangchanberta-traffy-multi")
# Example text to classify
text = "ช่วยด้วยครับถนนน้ำท่วมอีกแล้ว ต้นไม้ก็ล้มขวางทาง กลับบ้านไม่ได้"
# Encode the text using the tokenizer
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Get model predictions (logits)
with torch.no_grad():
logits = model(**inputs).logits
# Apply sigmoid function to convert logits to probabilities
probabilities = sigmoid(logits)
# Map probabilities to corresponding labels
probabilities = probabilities.squeeze().tolist()
label_probabilities = zip(target_list, probabilities)
# Print labels with probabilities
for label, probability in label_probabilities:
print(f"{label}: {probability:.4f}")
# Or JSON
# Create a dictionary for labels and probabilities
results_dict = {label: probability for label, probability in label_probabilities}
# Convert dictionary to JSON string
results_json = json.dumps(results_dict, ensure_ascii=False, indent=4)
# Print the JSON string
print(results_json)
```
## Training Details
The model was trained on traffic complaint data API (included stopwords) using the airesearch/wangchanberta-base-att-spm-uncased base model. This is a multi-label classification task with a total of 24 classes.
## Training Scores
| Model | Stopword | Epoch | Training Loss | Validation Loss | F1 | Accuracy |
| ---------------------------------- | -------- | ----- | ------------- | --------------- | ------- | -------- |
| wangchanberta-base-att-spm-uncased | Included | 0 | 0.0322 | 0.034822 | 0.7015 | 0.7569 |
| wangchanberta-base-att-spm-uncased | Included | 2 | 0.0207 | 0.026364 | 0.8405 | 0.7821 |
| wangchanberta-base-att-spm-uncased | Included | 4 | 0.0165 | 0.025142 | 0.8458 | 0.7934 |
Feel free to customize the README further if needed. |