hplisiecki
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,73 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
# Polemo Intensity Model
|
5 |
+
|
6 |
+
This model is fine-tuned for emotion intensity detection in Polish political texts. It is based on the RoBERTa transformer model and has been specifically trained to recognize a range of emotions in text data scraped from various social media platforms.
|
7 |
+
|
8 |
+
## Model Details
|
9 |
+
|
10 |
+
### Database Preparation
|
11 |
+
|
12 |
+
Our research utilizes a comprehensive database of Polish political texts from social media profiles (i.e., YouTube, Twitter, Facebook) of 25 journalists, 25 politicians, and 19 non-governmental organizations (NGOs). For each profile, all available posts from each platform were scraped going back to the beginning of 2019. In addition, we included texts written by non-professional commentators of social affairs. Our dataset consists of 1,246,337 text snippets:
|
13 |
+
- Twitter: 789,490 tweets
|
14 |
+
- YouTube: 42,252 comments
|
15 |
+
- Facebook: 414,595 posts
|
16 |
+
|
17 |
+
The texts were processed to fit transformer models' length constraints. Facebook texts were split into sentences, and all texts longer than 280 characters were removed. Non-Polish texts were filtered out using the `langdetect` software, and all online links and usernames were replaced with placeholders. We focused on texts with higher emotional content for training, resulting in a final dataset of 10,000 texts, annotated by 20 expert annotators.
|
18 |
+
|
19 |
+
### Annotation Process
|
20 |
+
|
21 |
+
The final dataset was annotated for the following emotions:
|
22 |
+
- Happiness
|
23 |
+
- Sadness
|
24 |
+
- Anger
|
25 |
+
- Disgust
|
26 |
+
- Fear
|
27 |
+
- Pride
|
28 |
+
- Valence
|
29 |
+
- Arousal
|
30 |
+
|
31 |
+
Annotators used a 5-point scale for each emotion and dimension. The annotation process ensured consistency and minimized subjectivity, with each text being annotated by five different annotators.
|
32 |
+
|
33 |
+
### Model Training
|
34 |
+
|
35 |
+
We considered two base models: the Trelbert transformer model and the Polish Roberta model. The final model relied on the Roberta transformer model and was fine-tuned using a Bayesian grid search for hyperparameter optimization. The training involved:
|
36 |
+
- Dropout: 0.6
|
37 |
+
- Learning rate: 5e-5
|
38 |
+
- Weight decay: 0.3
|
39 |
+
- Warmup steps: 600
|
40 |
+
|
41 |
+
### Results
|
42 |
+
|
43 |
+
The model demonstrated strong correlations with human ratings, particularly in predicting happiness and valence, achieving correlations of 0.87. The results for other emotions were also substantial, indicating the model's ability to capture a wide range of emotional states.
|
44 |
+
|
45 |
+
### K-Fold Validation
|
46 |
+
|
47 |
+
A 10-fold cross-validation showed high reliability across different emotional dimensions:
|
48 |
+
- Happiness: 0.83
|
49 |
+
- Sadness: 0.68
|
50 |
+
- Anger: 0.81
|
51 |
+
- Disgust: 0.75
|
52 |
+
- Fear: 0.67
|
53 |
+
- Pride: 0.76
|
54 |
+
- Valence: 0.84
|
55 |
+
- Arousal: 0.71
|
56 |
+
|
57 |
+
## Usage
|
58 |
+
|
59 |
+
You can use the model and tokenizer as follows:
|
60 |
+
|
61 |
+
```python
|
62 |
+
from transformers import AutoTokenizer, AutoModel
|
63 |
+
|
64 |
+
# Load the tokenizer
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained("hplisiecki/polemo-intensity")
|
66 |
+
|
67 |
+
# Load the model
|
68 |
+
model = AutoModel.from_pretrained("hplisiecki/polemo-intensity")
|
69 |
+
|
70 |
+
# Test the model with a sample input
|
71 |
+
inputs = tokenizer("This is a test input.", return_tensors="pt")
|
72 |
+
outputs = model(**inputs)
|
73 |
+
print(outputs)
|