5CD-AI
/

Vietnamese-Sentiment-visobert

+---
+language:
+- vi
+metrics:
+- accuracy
+- f1
+tags:
+- sentiment-analysis
+- social-listening
+---
+# 5CD-ViSoBERT for Vietnamese Sentiment Analysis
+This model is based on our pretrained [5CD-AI/visobert-14gb-corpus](https://huggingface.co/5CD-AI/visobert-14gb-corpus), which has been continuously trained on a 14GB dataset of Vietnamese social content.
+Our model is fine-tuned on <b>120K</b> Vietnamese sentiment datasets, including comments and reviews from e-commerce platforms, social media, and forums.
+The model will give softmax outputs for three labels.
+<b>Labels</b>:
+```
+0 -> Negative
+1 -> Positive
+2 -> Neutral
+```
+## Usage (HuggingFace Transformers)
+Install `transformers` package:
+    pip install transformers
+### Pipeline
+```python
+from transformers import pipeline
+model_path = '5CD-AI/vietnamese-sentiment-visobert'
+sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
+sentiment_task("Miếng dán dễ xước , ko khít với dt 11 prm")
+```
+```
+[{'label': 'NEG', 'score': 0.998149037361145}]
+```
+### Full classification
+```python
+from transformers import AutoModelForSequenceClassification
+from transformers import TFAutoModelForSequenceClassification
+from transformers import AutoTokenizer, AutoConfig
+import numpy as np
+import torch
+#### Load model
+model_path = '/content/drive/MyDrive/AI/LLMs/Social LLMs/Social Bert/models/checkpoint-9040'
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+config = AutoConfig.from_pretrained(model_path)
+model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
+# sentence = 'Đây là mô hình rất hay, đáp ứng tốt nhu cầu của nhiều doanh nghiệp Việt.'
+# sentence = 'Cà phê dở ko ngon, ai chưa mua thì đừng mua'
+# sentence = 'Cũng tạm. Ko gì đb'
+# sentence = 'Nịnh bợ vừa thui báo ơi.nhà từ trong trứng ra mà sao sáng đc.'
+# sentence = '''Dm mới kéo cái tuột luôn cái kính cường lực🙂
+# R phải cầm cái kính tự dán🙂 để lâu quá nó dính hai cục bụi lên nữa chứ má bực thiệt chứ
+# Hình như tại hai cái cục nam châm nó xúc ra 😑'''
+# sentence = 'Mấy cái khóa kiểu này ông lên youtube tự học còn ngon hơn.'
+sentence = 'Cũng giống mấy khoá Youtube học cũng được'
+print('Sentence: ', sentence)
+input_ids = torch.tensor([tokenizer.encode(sentence)]).to("cuda")
+with torch.no_grad():
+    out = model(input_ids)
+    scores = out.logits.softmax(dim=-1).cpu().numpy()[0]
+    # print(out.logits.softmax(dim=-1).tolist())
+# Print labels and scores
+ranking = np.argsort(scores)
+ranking = ranking[::-1]
+print("### Sentiment score ####")
+for i in range(scores.shape[0]):
+    l = config.id2label[ranking[i]]
+    s = scores[ranking[i]]
+    print(f"{i+1}) {l}: {np.round(float(s), 4)}")
+```
+Output:
+```
+Sentence:  Cũng giống mấy khoá Youtube học cũng được
+### Sentiment score ####
+1) NEU: 0.8928
+2) NEG: 0.0586
+3) POS: 0.0486
+```
+## Fine-tune Configuration
+We fine-tune `5CD-AI/visobert-14gb-corpus` on downstream tasks with `transformers` library with the following configuration:
+- seed: 42
+- gradient_accumulation_steps: 1
+- weight_decay: 0.01
+- optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
+- training_epochs: 5
+- model_max_length: 256
+- learning_rate: 2e-5
+- metric_for_best_model: wf1
+- strategy: epoch