rkdaldus's picture
Update README.md
d3c07ea verified
---
library_name: transformers
base_model:
- monologg/kobert
---
# KoBERT ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ
์ด ํ”„๋กœ์ ํŠธ๋Š” **ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์˜ ๊ฐ์ •์„ ๋ถ„๋ฅ˜**ํ•˜๋Š” KoBERT ๊ธฐ๋ฐ˜์˜ ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์ž…๋ ฅ๋œ ํ…์ŠคํŠธ๊ฐ€ **๋ถ„๋…ธ(Anger), ๋‘๋ ค์›€(Fear), ๊ธฐ์จ(Happy), ํ‰์˜จ(Tender), ์Šฌํ””(Sad)** ์ค‘ ์–ด๋–ค ๊ฐ์ •์— ํ•ด๋‹นํ•˜๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
## 1. ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •
### Colab ํ™˜๊ฒฝ ์„ค์ • ๋ฐ ๋ฐ์ดํ„ฐ ์ค€๋น„
1. **ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜**:
`transformers`, `datasets`, `torch`, `pandas`, `scikit-learn` ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
2. **๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ**:
ai hub ์— ๋“ฑ๋ก๋œ ํ•œ๊ตญ์–ด ๊ฐ์„ฑ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ์ • ๋ถ„๋ฅ˜์šฉ CSV ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.
3. **๋ฐ์ดํ„ฐ์…‹ ์ค€๋น„**:
- **ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ ๋ถ„ํ• **: 80%๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ, 20%๋Š” ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ.
- **HuggingFace Dataset ํ˜•์‹ ๋ณ€ํ™˜**: Pandas DataFrame์„ HuggingFace `Dataset`์œผ๋กœ ๋ณ€ํ™˜.
- **๋ ˆ์ด๋ธ” ์ปฌ๋Ÿผ๋ช… ๋ณ€๊ฒฝ**: ๊ฐ์ • ๋ ˆ์ด๋ธ”์„ ๋‚˜ํƒ€๋‚ด๋Š” `label_int` ์ปฌ๋Ÿผ์„ `labels`๋กœ ๋ณ€๊ฒฝ.
- **๋ฐ์ดํ„ฐ ํ† ํฐํ™”**: `monologg/kobert` ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ด์šฉํ•ด ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”.
- **ํ˜•์‹ ๋ณ€ํ™˜**: `input_ids`, `attention_mask`, `labels`๋งŒ ๋‚จ๊ฒจ ํ•™์Šต ์ค€๋น„ ์™„๋ฃŒ.
4. **๋ชจ๋ธ ๋ฐ ํ•™์Šต ์„ค์ •**:
- **๋ชจ๋ธ**: `monologg/kobert` ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ 5๊ฐœ์˜ ๊ฐ์ • ๋ ˆ์ด๋ธ”์„ ๋ถ„๋ฅ˜ํ•˜๋„๋ก ์„ค์ •.
- **ํ•™์Šต ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ**:
- `learning_rate=2e-5`, `num_train_epochs=10`, `batch_size=16`.
- F1 ์Šค์ฝ”์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒ ์ŠคํŠธ ๋ชจ๋ธ ์ €์žฅ.
- Early stopping ์ ์šฉ.
5. **ํ•™์Šต ์ง„ํ–‰ ๋ฐ ๋ชจ๋ธ ์ €์žฅ**:
- ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋ชจ๋ธ์„ Google Drive์— ์ €์žฅ.
### ์„ฑ๋Šฅ ํ‰๊ฐ€ ๋ฐ ํ…Œ์ŠคํŠธ
- **ํ‰๊ฐ€ ์ง€ํ‘œ**: Accuracy, F1 score (macro, weighted) ๊ณ„์‚ฐ.
- **ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ํ‰๊ฐ€**: ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€.
## 2. ๋ชจ๋ธ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### ์‚ฌ์ „ ์ค€๋น„
- HuggingFace Hub์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
- ๋ชจ๋ธ ๋ฐ ํ† ํฌ๋‚˜์ด์ €๋Š” `monologg/kobert` ๊ธฐ๋ฐ˜์ด๋ฉฐ, ๋ถ„๋ฅ˜ ๋ ˆ์ด๋ธ”์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
- **Anger**: ๐Ÿ˜ก
- **Fear**: ๐Ÿ˜จ
- **Happy**: ๐Ÿ˜Š
- **Tender**: ๐Ÿฅฐ
- **Sad**: ๐Ÿ˜ข
### ์‚ฌ์šฉ ์˜ˆ์‹œ
1. **๋‹จ์ˆœ ๋ฌธ์žฅ ์ž…๋ ฅ ๊ฐ์ • ๋ถ„์„**:
- ์‚ฌ์šฉ์ž๊ฐ€ ์ž…๋ ฅํ•œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋ชจ๋ธ์ด ๊ฐ์ •์„ ์˜ˆ์ธกํ•˜๊ณ , ๊ฐ ๊ฐ์ •์˜ ํ™•๋ฅ ์„ ํ•จ๊ป˜ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
2. **์—‘์…€ ํŒŒ์ผ์—์„œ ๊ฐ์ • ๋ถ„์„**:
- ์—‘์…€ ํŒŒ์ผ์—์„œ ์ง€์ •ํ•œ ํ…์ŠคํŠธ ์—ด๊ณผ ํ–‰ ๋ฒ”์œ„๋ฅผ ์ฝ์–ด์™€, ํ•ด๋‹น ํ…์ŠคํŠธ๋“ค์— ๋Œ€ํ•ด ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
### ์ฝ”๋“œ ์‚ฌ์šฉ ์˜ˆ์‹œ
```python
# ํ† ํฌ๋‚˜์ด์ € ๋ฐ ๋ชจ๋ธ ๋กœ๋“œ
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# KoBERT ํ† ํฌ๋‚˜์ด์ €์™€ ๋ชจ๋ธ ๋กœ๋“œ
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained("rkdaldus/ko-sent5-classification")
# ์‚ฌ์šฉ์ž ์ž…๋ ฅ ํ…์ŠคํŠธ ๊ฐ์ • ๋ถ„์„
text = "์˜ค๋Š˜ ์ •๋ง ํ–‰๋ณตํ•ด!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
predicted_label = torch.argmax(outputs.logits, dim=1).item()
# ๊ฐ์ • ๋ ˆ์ด๋ธ” ์ •์˜
emotion_labels = {
0: ("Angry", "๐Ÿ˜ก"),
1: ("Fear", "๐Ÿ˜จ"),
2: ("Happy", "๐Ÿ˜Š"),
3: ("Tender", "๐Ÿฅฐ"),
4: ("Sad", "๐Ÿ˜ข")
}
# ์˜ˆ์ธก๋œ ๊ฐ์ • ์ถœ๋ ฅ
print(f"์˜ˆ์ธก๋œ ๊ฐ์ •: {emotion_labels[predicted_label][0]} {emotion_labels[predicted_label][1]}")