Update README.md
Browse files
README.md
CHANGED
@@ -12,23 +12,13 @@ library_name: transformers
|
|
12 |
|
13 |
# 5CD-ViSoBERT for Vietnamese Sentiment Analysis
|
14 |
|
15 |
-
<b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A VIETNAMESE
|
16 |
|
17 |
<b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
|
18 |
|
19 |
-
This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
|
20 |
|
21 |
-
Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums
|
22 |
-
|
23 |
-
Our model outperforms on the datasets:
|
24 |
-
- SA-VLSP2016
|
25 |
-
- AIVIVN-2019
|
26 |
-
- UIT-VSFC
|
27 |
-
- UIT-VSMEC
|
28 |
-
- UIT-ViCTSD
|
29 |
-
- UIT-ViOCD
|
30 |
-
- UIT-ViSFD
|
31 |
-
- Vi-amazon-polar
|
32 |
|
33 |
The model will give softmax outputs for three labels.
|
34 |
|
@@ -41,7 +31,7 @@ The model will give softmax outputs for three labels.
|
|
41 |
```
|
42 |
|
43 |
## Dataset
|
44 |
-
Our training dataset.
|
45 |
<table border="2">
|
46 |
<tr align="center">
|
47 |
<th rowspan="2">Dataset</th>
|
@@ -365,63 +355,7 @@ Output:
|
|
365 |
[{'label': 'NEG', 'score': 0.998149037361145}]
|
366 |
```
|
367 |
|
368 |
-
###
|
369 |
-
|
370 |
-
```python
|
371 |
-
from transformers import AutoModelForSequenceClassification
|
372 |
-
from transformers import TFAutoModelForSequenceClassification
|
373 |
-
from transformers import AutoTokenizer, AutoConfig
|
374 |
-
import numpy as np
|
375 |
-
import torch
|
376 |
-
|
377 |
-
#### Load model
|
378 |
-
model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
|
379 |
-
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
380 |
-
config = AutoConfig.from_pretrained(model_path)
|
381 |
-
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
|
382 |
-
|
383 |
-
# sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
|
384 |
-
# sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
|
385 |
-
# sentence = 'Cũng tạm. Ko gì đb'
|
386 |
-
# sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
|
387 |
-
|
388 |
-
# sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
|
389 |
-
# R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
|
390 |
-
# Hình như tại hai cái cục nam châm nó xúc ra 😑'''
|
391 |
-
|
392 |
-
# sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
|
393 |
-
sentence = 'Cũng giống mấy khoá Youtube học cũng được'
|
394 |
-
print('Sentence: ', sentence)
|
395 |
-
|
396 |
-
input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
|
397 |
-
|
398 |
-
with torch.no_grad():
|
399 |
-
out = model(input_ids)
|
400 |
-
scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
|
401 |
-
# print(out.logits.softmax(dim=-1).tolist())
|
402 |
-
|
403 |
-
# Print labels and scores
|
404 |
-
ranking = np.argsort(scores)
|
405 |
-
ranking = ranking[::-1]
|
406 |
-
print("### Sentiment score ####")
|
407 |
-
for i in range(scores.shape[0]):
|
408 |
-
l = config.id2label[ranking[i]]
|
409 |
-
s = scores[ranking[i]]
|
410 |
-
print(f"{i+1}) {l}: {np.round(float(s), 4)}")
|
411 |
-
```
|
412 |
-
|
413 |
-
Output:
|
414 |
-
|
415 |
-
```
|
416 |
-
Sentence: Cũng giống mấy khoá Youtube học cũng được
|
417 |
-
### Sentiment score ####
|
418 |
-
1) NEU: 0.8928
|
419 |
-
2) NEG: 0.0586
|
420 |
-
3) POS: 0.0486
|
421 |
-
|
422 |
-
```
|
423 |
-
|
424 |
-
### Other example
|
425 |
```
|
426 |
Sentence: Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
|
427 |
### Sentiment score ####
|
@@ -506,7 +440,49 @@ Sentence: Mấy cái khóa kiểu này ông lên youtube tự học còn ngon h
|
|
506 |
3) NEU: 0.0024
|
507 |
```
|
508 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
509 |
|
|
|
510 |
|
511 |
|
512 |
## Fine-tune Configuration
|
|
|
12 |
|
13 |
# 5CD-ViSoBERT for Vietnamese Sentiment Analysis
|
14 |
|
15 |
+
<b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A 🇻🇳 VIETNAMESE SENTIMENT ANALYSIS MODEL OVER AND OVER AGAIN?</b>
|
16 |
|
17 |
<b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
|
18 |
|
19 |
+
This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content. So it can perform well with many comment sentiments accompanied by emojis 😂👍💬🔥
|
20 |
|
21 |
+
Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums. Our model has been trained on a diverse range of datasets: SA-VLSP2016, AIVIVN-2019, UIT-VSFC, UIT-VSMEC, UIT-ViCTSD, UIT-ViOCD, UIT-ViSFD, Vi-amazon-reviews, Tiki-reviews.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
The model will give softmax outputs for three labels.
|
24 |
|
|
|
31 |
```
|
32 |
|
33 |
## Dataset
|
34 |
+
Our training dataset. Because of label ambiguity, with UIT-VSMEC, UIT-ViCTSD, VOZ-HSD, we re-label the dataset with Gemini 1.5 Flash API follow the 3 labels. he specific number of samples for each dataset can be seen below:
|
35 |
<table border="2">
|
36 |
<tr align="center">
|
37 |
<th rowspan="2">Dataset</th>
|
|
|
355 |
[{'label': 'NEG', 'score': 0.998149037361145}]
|
356 |
```
|
357 |
|
358 |
+
### Other examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
359 |
```
|
360 |
Sentence: Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
|
361 |
### Sentiment score ####
|
|
|
440 |
3) NEU: 0.0024
|
441 |
```
|
442 |
|
443 |
+
### Full classification
|
444 |
+
|
445 |
+
```python
|
446 |
+
from transformers import AutoModelForSequenceClassification
|
447 |
+
from transformers import TFAutoModelForSequenceClassification
|
448 |
+
from transformers import AutoTokenizer, AutoConfig
|
449 |
+
import numpy as np
|
450 |
+
import torch
|
451 |
+
|
452 |
+
#### Load model
|
453 |
+
model_path = '5CD-AI/Vietnamese-sentiment-visobert-base'
|
454 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
455 |
+
config = AutoConfig.from_pretrained(model_path)
|
456 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
|
457 |
+
|
458 |
+
sentence = 'Cũng giống mấy khoá Youtube học cũng được'
|
459 |
+
print('Sentence: ', sentence)
|
460 |
+
|
461 |
+
input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
|
462 |
+
|
463 |
+
with torch.no_grad():
|
464 |
+
out = model(input_ids)
|
465 |
+
scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
|
466 |
+
|
467 |
+
# Print labels and scores
|
468 |
+
ranking = np.argsort(scores)
|
469 |
+
ranking = ranking[::-1]
|
470 |
+
print("### Sentiment score ####")
|
471 |
+
for i in range(scores.shape[0]):
|
472 |
+
l = config.id2label[ranking[i]]
|
473 |
+
s = scores[ranking[i]]
|
474 |
+
print(f"{i+1}) {l}: {np.round(float(s), 4)}")
|
475 |
+
```
|
476 |
+
Output:
|
477 |
+
|
478 |
+
```
|
479 |
+
Sentence: Cũng giống mấy khoá Youtube học cũng được
|
480 |
+
### Sentiment score ####
|
481 |
+
1) NEU: 0.8928
|
482 |
+
2) NEG: 0.0586
|
483 |
+
3) POS: 0.0486
|
484 |
|
485 |
+
```
|
486 |
|
487 |
|
488 |
## Fine-tune Configuration
|