baohuynhbk14 commited on
Commit
e148f23
·
verified ·
1 Parent(s): 160cca5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ metrics:
5
+ - accuracy
6
+ - f1
7
+ tags:
8
+ - sentiment-analysis
9
+ - social-listening
10
+ ---
11
+
12
+ # 5CD-ViSoBERT for Vietnamese Sentiment Analysis
13
+
14
+ This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
15
+
16
+ Our model is fine-tuned on <b>120K</b> Vietnamese sentiment datasets, including comments and reviews from e-commerce platforms, social media, and forums.
17
+
18
+ The model will give softmax outputs for three labels.
19
+
20
+ <b>Labels</b>:
21
+
22
+ ```
23
+ 0 -> Negative
24
+ 1 -> Positive
25
+ 2 -> Neutral
26
+ ```
27
+ ## Usage (HuggingFace Transformers)
28
+
29
+ Install `transformers` package:
30
+
31
+ pip install transformers
32
+
33
+
34
+ ### Pipeline
35
+ ```python
36
+ from transformers import pipeline
37
+ model_path = '5CD-AI/vietnamese-sentiment-visobert'
38
+ sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
39
+ sentiment_task("Miếng dán dễ xước , ko khít với dt 11 prm")
40
+ ```
41
+ ```
42
+ [{'label': 'NEG', 'score': 0.998149037361145}]
43
+ ```
44
+
45
+ ### Full classification
46
+
47
+ ```python
48
+ from transformers import AutoModelForSequenceClassification
49
+ from transformers import TFAutoModelForSequenceClassification
50
+ from transformers import AutoTokenizer, AutoConfig
51
+ import numpy as np
52
+ import torch
53
+
54
+ #### Load model
55
+ model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
56
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
57
+ config = AutoConfig.from_pretrained(model_path)
58
+ model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
59
+
60
+ # sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
61
+ # sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
62
+ # sentence = 'Cũng tạm. Ko gì đb'
63
+ # sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
64
+
65
+ # sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
66
+ # R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
67
+ # Hình như tại hai cái cục nam châm nó xúc ra 😑'''
68
+
69
+ # sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
70
+ sentence = 'Cũng giống mấy khoá Youtube học cũng được'
71
+ print('Sentence: ', sentence)
72
+
73
+ input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
74
+
75
+ with torch.no_grad():
76
+ out = model(input_ids)
77
+ scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
78
+ # print(out.logits.softmax(dim=-1).tolist())
79
+
80
+ # Print labels and scores
81
+ ranking = np.argsort(scores)
82
+ ranking = ranking[::-1]
83
+ print("### Sentiment score ####")
84
+ for i in range(scores.shape[0]):
85
+ l = config.id2label[ranking[i]]
86
+ s = scores[ranking[i]]
87
+ print(f"{i+1}) {l}: {np.round(float(s), 4)}")
88
+ ```
89
+
90
+ Output:
91
+
92
+ ```
93
+ Sentence: Cũng giống mấy khoá Youtube học cũng được
94
+ ### Sentiment score ####
95
+ 1) NEU: 0.8928
96
+ 2) NEG: 0.0586
97
+ 3) POS: 0.0486
98
+
99
+ ```
100
+
101
+ ## Fine-tune Configuration
102
+ We fine-tune `5CD-AI/visobert-14gb-corpus` on downstream tasks with `transformers` library with the following configuration:
103
+ - seed: 42
104
+ - gradient_accumulation_steps: 1
105
+ - weight_decay: 0.01
106
+ - optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
107
+ - training_epochs: 5
108
+ - model_max_length: 256
109
+ - learning_rate: 2e-5
110
+ - metric_for_best_model: wf1
111
+ - strategy: epoch