5CD-AI
/

Vietnamese-Sentiment-visobert

@@ -12,23 +12,13 @@ library_name: transformers
 # 5CD-ViSoBERT for Vietnamese Sentiment Analysis
-<b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A VIETNAMESE 🇻🇳 SENTIMENT ANALYSIS MODEL OVER AND OVER AGAIN?</b>
 <b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
-This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
-Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums
-Our model outperforms on the datasets:
-- SA-VLSP2016
-- AIVIVN-2019
-- UIT-VSFC
-- UIT-VSMEC
-- UIT-ViCTSD
-- UIT-ViOCD
-- UIT-ViSFD
-- Vi-amazon-polar
 The model will give softmax outputs for three labels.
@@ -41,7 +31,7 @@ The model will give softmax outputs for three labels.
 ```
 ## Dataset
-Our training dataset. With UIT-VSMEC, UIT-ViCTSD, VOZ-HSD, we re-label the dataset with Gemini 1.5 Flash API follow the 3 labels.
 <table border="2">
     <tr align="center">
         <th rowspan="2">Dataset</th>
@@ -365,63 +355,7 @@ Output:
 [{'label': 'NEG', 'score': 0.998149037361145}]
 ```
-### Full classification
-```python
-from transformers import AutoModelForSequenceClassification
-from transformers import TFAutoModelForSequenceClassification
-from transformers import AutoTokenizer, AutoConfig
-import numpy as np
-import torch
-#### Load model
-model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
-tokenizer = AutoTokenizer.from_pretrained(model_path)
-config = AutoConfig.from_pretrained(model_path)
-model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
-# sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
-# sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
-# sentence = 'Cũng tạm. Ko gì đb'
-# sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
-# sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
-# R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
-# Hình như tại hai cái cục nam châm nó xúc ra 😑'''
-# sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
-sentence = 'Cũng giống mấy khoá Youtube học cũng được'
-print('Sentence: ', sentence)
-input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
-with torch.no_grad():
-    out = model(input_ids)
-    scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
-    # print(out.logits.softmax(dim=-1).tolist())
-# Print labels and scores
-ranking = np.argsort(scores)
-ranking = ranking[::-1]
-print("### Sentiment score ####")
-for i in range(scores.shape[0]):
-    l = config.id2label[ranking[i]]
-    s = scores[ranking[i]]
-    print(f"{i+1}) {l}: {np.round(float(s), 4)}")
-```
-Output:
-```
-Sentence:  Cũng giống mấy khoá Youtube học cũng được
-### Sentiment score ####
-1) NEU: 0.8928
-2) NEG: 0.0586
-3) POS: 0.0486
-```
-### Other example
 ```
 Sentence:  Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
 ### Sentiment score ####
@@ -506,7 +440,49 @@ Sentence:  Mấy cái khóa kiểu này ông lên youtube tự học còn ngon h
 3) NEU: 0.0024
 ```
 ## Fine-tune Configuration

 # 5CD-ViSoBERT for Vietnamese Sentiment Analysis
+<b>YOU ARE TOO BORED AND TIRED OF HAVING TO BUILD A 🇻🇳 VIETNAMESE SENTIMENT ANALYSIS MODEL OVER AND OVER AGAIN?</b>
 <b> BOOM! 🤯 NO WORRIES, WE'RE HERE FOR YOU =)) 🔥!</b>
+This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content. So it can perform well with many comment sentiments accompanied by emojis 😂👍💬🔥
+Our model is fine-tuned on <b>120K Vietnamese sentiment datasets </b>, including comments and reviews from e-commerce platforms, social media, and forums. Our model has been trained on a diverse range of datasets: SA-VLSP2016, AIVIVN-2019, UIT-VSFC, UIT-VSMEC, UIT-ViCTSD, UIT-ViOCD, UIT-ViSFD, Vi-amazon-reviews, Tiki-reviews.
 The model will give softmax outputs for three labels.
 ```
 ## Dataset
+Our training dataset. Because of label ambiguity, with UIT-VSMEC, UIT-ViCTSD, VOZ-HSD, we re-label the dataset with Gemini 1.5 Flash API follow the 3 labels. he specific number of samples for each dataset can be seen below:
 <table border="2">
     <tr align="center">
         <th rowspan="2">Dataset</th>
 [{'label': 'NEG', 'score': 0.998149037361145}]
 ```
+### Other examples
 ```
 Sentence:  Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.
 ### Sentiment score ####
 3) NEU: 0.0024
 ```
+### Full classification
+```python
+from transformers import AutoModelForSequenceClassification
+from transformers import TFAutoModelForSequenceClassification
+from transformers import AutoTokenizer, AutoConfig
+import numpy as np
+import torch
+#### Load model
+model_path = '5CD-AI/Vietnamese-sentiment-visobert-base'
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+config = AutoConfig.from_pretrained(model_path)
+model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
+sentence = 'Cũng giống mấy khoá Youtube học cũng được'
+print('Sentence: ', sentence)
+input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
+with torch.no_grad():
+    out = model(input_ids)
+    scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
+# Print labels and scores
+ranking = np.argsort(scores)
+ranking = ranking[::-1]
+print("### Sentiment score ####")
+for i in range(scores.shape[0]):
+    l = config.id2label[ranking[i]]
+    s = scores[ranking[i]]
+    print(f"{i+1}) {l}: {np.round(float(s), 4)}")
+```
+Output:
+```
+Sentence:  Cũng giống mấy khoá Youtube học cũng được
+### Sentiment score ####
+1) NEU: 0.8928
+2) NEG: 0.0586
+3) POS: 0.0486
+```
 ## Fine-tune Configuration