khang119966 commited on
Commit
a7e76de
·
verified ·
1 Parent(s): 29ed3a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -71
README.md CHANGED
@@ -12,23 +12,13 @@ library_name: transformers
12
 
13
  # 5CD-ViSoBERT for Vietnamese Sentiment Analysis
14
 
15
- <b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A VIETNAMESE 🇻🇳 SENTIMENT ANALYSIS MODEL OVER AND OVER AGAIN?</b>
16
 
17
  <b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
18
 
19
- This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
20
 
21
- Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums
22
-
23
- Our model outperforms on the datasets:
24
- - SA-VLSP2016
25
- - AIVIVN-2019
26
- - UIT-VSFC
27
- - UIT-VSMEC
28
- - UIT-ViCTSD
29
- - UIT-ViOCD
30
- - UIT-ViSFD
31
- - Vi-amazon-polar
32
 
33
  The model will give softmax outputs for three labels.
34
 
@@ -41,7 +31,7 @@ The model will give softmax outputs for three labels.
41
  ```
42
 
43
  ## Dataset
44
- Our training dataset. With UIT-VSMEC, UIT-ViCTSD, VOZ-HSD, we re-label the dataset with Gemini 1.5 Flash API follow the 3 labels.
45
  <table border="2">
46
  <tr align="center">
47
  <th rowspan="2">Dataset</th>
@@ -365,63 +355,7 @@ Output:
365
  [{'label': 'NEG', 'score': 0.998149037361145}]
366
  ```
367
 
368
- ### Full classification
369
-
370
- ```python
371
- from transformers import AutoModelForSequenceClassification
372
- from transformers import TFAutoModelForSequenceClassification
373
- from transformers import AutoTokenizer, AutoConfig
374
- import numpy as np
375
- import torch
376
-
377
- #### Load model
378
- model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
379
- tokenizer = AutoTokenizer.from_pretrained(model_path)
380
- config = AutoConfig.from_pretrained(model_path)
381
- model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
382
-
383
- # sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
384
- # sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
385
- # sentence = 'Cũng tạm. Ko gì đb'
386
- # sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
387
-
388
- # sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
389
- # R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
390
- # Hình như tại hai cái cục nam châm nó xúc ra 😑'''
391
-
392
- # sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
393
- sentence = 'Cũng giống mấy khoá Youtube học cũng được'
394
- print('Sentence: ', sentence)
395
-
396
- input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
397
-
398
- with torch.no_grad():
399
- out = model(input_ids)
400
- scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
401
- # print(out.logits.softmax(dim=-1).tolist())
402
-
403
- # Print labels and scores
404
- ranking = np.argsort(scores)
405
- ranking = ranking[::-1]
406
- print("### Sentiment score ####")
407
- for i in range(scores.shape[0]):
408
- l = config.id2label[ranking[i]]
409
- s = scores[ranking[i]]
410
- print(f"{i+1}) {l}: {np.round(float(s), 4)}")
411
- ```
412
-
413
- Output:
414
-
415
- ```
416
- Sentence: Cũng giống mấy khoá Youtube học cũng được
417
- ### Sentiment score ####
418
- 1) NEU: 0.8928
419
- 2) NEG: 0.0586
420
- 3) POS: 0.0486
421
-
422
- ```
423
-
424
- ### Other example
425
  ```
426
  Sentence: Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
427
  ### Sentiment score ####
@@ -506,7 +440,49 @@ Sentence: Mấy cái khóa kiểu này ông lên youtube tự học còn ngon h
506
  3) NEU: 0.0024
507
  ```
508
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
509
 
 
510
 
511
 
512
  ## Fine-tune Configuration
 
12
 
13
  # 5CD-ViSoBERT for Vietnamese Sentiment Analysis
14
 
15
+ <b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A 🇻🇳 VIETNAMESE SENTIMENT ANALYSIS MODEL OVER AND OVER AGAIN?</b>
16
 
17
  <b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
18
 
19
+ This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content. So it can perform well with many comment sentiments accompanied by emojis 😂👍💬🔥
20
 
21
+ Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums. Our model has been trained on a diverse range of datasets: SA-VLSP2016, AIVIVN-2019, UIT-VSFC, UIT-VSMEC, UIT-ViCTSD, UIT-ViOCD, UIT-ViSFD, Vi-amazon-reviews, Tiki-reviews.
 
 
 
 
 
 
 
 
 
 
22
 
23
  The model will give softmax outputs for three labels.
24
 
 
31
  ```
32
 
33
  ## Dataset
34
+ Our training dataset. Because of label ambiguity, with UIT-VSMEC, UIT-ViCTSD, VOZ-HSD, we re-label the dataset with Gemini 1.5 Flash API follow the 3 labels. he specific number of samples for each dataset can be seen below:
35
  <table border="2">
36
  <tr align="center">
37
  <th rowspan="2">Dataset</th>
 
355
  [{'label': 'NEG', 'score': 0.998149037361145}]
356
  ```
357
 
358
+ ### Other examples
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
359
  ```
360
  Sentence: Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
361
  ### Sentiment score ####
 
440
  3) NEU: 0.0024
441
  ```
442
 
443
+ ### Full classification
444
+
445
+ ```python
446
+ from transformers import AutoModelForSequenceClassification
447
+ from transformers import TFAutoModelForSequenceClassification
448
+ from transformers import AutoTokenizer, AutoConfig
449
+ import numpy as np
450
+ import torch
451
+
452
+ #### Load model
453
+ model_path = '5CD-AI/Vietnamese-sentiment-visobert-base'
454
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
455
+ config = AutoConfig.from_pretrained(model_path)
456
+ model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
457
+
458
+ sentence = 'Cũng giống mấy khoá Youtube học cũng được'
459
+ print('Sentence: ', sentence)
460
+
461
+ input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
462
+
463
+ with torch.no_grad():
464
+ out = model(input_ids)
465
+ scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
466
+
467
+ # Print labels and scores
468
+ ranking = np.argsort(scores)
469
+ ranking = ranking[::-1]
470
+ print("### Sentiment score ####")
471
+ for i in range(scores.shape[0]):
472
+ l = config.id2label[ranking[i]]
473
+ s = scores[ranking[i]]
474
+ print(f"{i+1}) {l}: {np.round(float(s), 4)}")
475
+ ```
476
+ Output:
477
+
478
+ ```
479
+ Sentence: Cũng giống mấy khoá Youtube học cũng được
480
+ ### Sentiment score ####
481
+ 1) NEU: 0.8928
482
+ 2) NEG: 0.0586
483
+ 3) POS: 0.0486
484
 
485
+ ```
486
 
487
 
488
  ## Fine-tune Configuration