Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- vi
|
4 |
+
metrics:
|
5 |
+
- accuracy
|
6 |
+
- f1
|
7 |
+
tags:
|
8 |
+
- sentiment-analysis
|
9 |
+
- social-listening
|
10 |
+
---
|
11 |
+
|
12 |
+
# 5CD-ViSoBERT for Vietnamese Sentiment Analysis
|
13 |
+
|
14 |
+
This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
|
15 |
+
|
16 |
+
Our model is fine-tuned on <b>120K</b> Vietnamese sentiment datasets, including comments and reviews from e-commerce platforms, social media, and forums.
|
17 |
+
|
18 |
+
The model will give softmax outputs for three labels.
|
19 |
+
|
20 |
+
<b>Labels</b>:
|
21 |
+
|
22 |
+
```
|
23 |
+
0 -> Negative
|
24 |
+
1 -> Positive
|
25 |
+
2 -> Neutral
|
26 |
+
```
|
27 |
+
## Usage (HuggingFace Transformers)
|
28 |
+
|
29 |
+
Install `transformers` package:
|
30 |
+
|
31 |
+
pip install transformers
|
32 |
+
|
33 |
+
|
34 |
+
### Pipeline
|
35 |
+
```python
|
36 |
+
from transformers import pipeline
|
37 |
+
model_path = '5CD-AI/vietnamese-sentiment-visobert'
|
38 |
+
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
|
39 |
+
sentiment_task("Miếng dán dễ xước , ko khít với dt 11 prm")
|
40 |
+
```
|
41 |
+
```
|
42 |
+
[{'label': 'NEG', 'score': 0.998149037361145}]
|
43 |
+
```
|
44 |
+
|
45 |
+
### Full classification
|
46 |
+
|
47 |
+
```python
|
48 |
+
from transformers import AutoModelForSequenceClassification
|
49 |
+
from transformers import TFAutoModelForSequenceClassification
|
50 |
+
from transformers import AutoTokenizer, AutoConfig
|
51 |
+
import numpy as np
|
52 |
+
import torch
|
53 |
+
|
54 |
+
#### Load model
|
55 |
+
model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
|
56 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
57 |
+
config = AutoConfig.from_pretrained(model_path)
|
58 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
|
59 |
+
|
60 |
+
# sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
|
61 |
+
# sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
|
62 |
+
# sentence = 'Cũng tạm. Ko gì đb'
|
63 |
+
# sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
|
64 |
+
|
65 |
+
# sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
|
66 |
+
# R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
|
67 |
+
# Hình như tại hai cái cục nam châm nó xúc ra 😑'''
|
68 |
+
|
69 |
+
# sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
|
70 |
+
sentence = 'Cũng giống mấy khoá Youtube học cũng được'
|
71 |
+
print('Sentence: ', sentence)
|
72 |
+
|
73 |
+
input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
|
74 |
+
|
75 |
+
with torch.no_grad():
|
76 |
+
out = model(input_ids)
|
77 |
+
scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
|
78 |
+
# print(out.logits.softmax(dim=-1).tolist())
|
79 |
+
|
80 |
+
# Print labels and scores
|
81 |
+
ranking = np.argsort(scores)
|
82 |
+
ranking = ranking[::-1]
|
83 |
+
print("### Sentiment score ####")
|
84 |
+
for i in range(scores.shape[0]):
|
85 |
+
l = config.id2label[ranking[i]]
|
86 |
+
s = scores[ranking[i]]
|
87 |
+
print(f"{i+1}) {l}: {np.round(float(s), 4)}")
|
88 |
+
```
|
89 |
+
|
90 |
+
Output:
|
91 |
+
|
92 |
+
```
|
93 |
+
Sentence: Cũng giống mấy khoá Youtube học cũng được
|
94 |
+
### Sentiment score ####
|
95 |
+
1) NEU: 0.8928
|
96 |
+
2) NEG: 0.0586
|
97 |
+
3) POS: 0.0486
|
98 |
+
|
99 |
+
```
|
100 |
+
|
101 |
+
## Fine-tune Configuration
|
102 |
+
We fine-tune `5CD-AI/visobert-14gb-corpus` on downstream tasks with `transformers` library with the following configuration:
|
103 |
+
- seed: 42
|
104 |
+
- gradient_accumulation_steps: 1
|
105 |
+
- weight_decay: 0.01
|
106 |
+
- optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
|
107 |
+
- training_epochs: 5
|
108 |
+
- model_max_length: 256
|
109 |
+
- learning_rate: 2e-5
|
110 |
+
- metric_for_best_model: wf1
|
111 |
+
- strategy: epoch
|