Update README.md
Browse files
README.md
CHANGED
@@ -23,8 +23,8 @@ tags:
|
|
23 |
|
24 |
## Model Description
|
25 |
|
26 |
-
This model consists of a fine-tuned version of BgGPT-7B-Instruct-v0.2 for a propaganda detection task. It is effectively a multilabel classifier, determining wether a given propaganda text contains or not 5 predefined propaganda types.
|
27 |
-
This model was created by [`Identrics`](https://identrics.ai/), in the scope of the
|
28 |
|
29 |
|
30 |
## Propaganda taxonomy
|
@@ -52,7 +52,7 @@ These techniques seek to influence the audience and control the conversation by
|
|
52 |
|
53 |
## Uses
|
54 |
|
55 |
-
To be used as a multilabel classifier to identify if the sample text contains one or more of the five propaganda techniques mentioned above.
|
56 |
|
57 |
### Example
|
58 |
|
@@ -69,7 +69,7 @@ Then the model can be downloaded and used for inference:
|
|
69 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
70 |
|
71 |
model = AutoModelForSequenceClassification.from_pretrained("identrics/BG_propaganda_classifier", num_labels=5)
|
72 |
-
tokenizer = AutoTokenizer.from_pretrained("identrics/
|
73 |
|
74 |
tokens = tokenizer("Our country is the most powerful country in the world!", return_tensors="pt")
|
75 |
output = model(**tokens)
|
@@ -91,12 +91,11 @@ print(output.logits)
|
|
91 |
|
92 |
## Training Details
|
93 |
|
94 |
-
The training datasets for the model consist of a balanced set totaling 734 Bulgarian examples that include both propaganda and non-propaganda content. These examples are collected from a variety of traditional media and social media sources, ensuring a diverse range of content. Aditionally, the training dataset is enriched with AI-generated samples. The total distribution of the training data is shown in the table below:
|
95 |
|
|
|
|
|
96 |
|
97 |
-

|
98 |
|
99 |
|
100 |
-
The model was then tested on a smaller evaluation dataset, achieving an f1 score of
|
101 |
|
102 |
-

|
|
|
23 |
|
24 |
## Model Description
|
25 |
|
26 |
+
This model consists of a fine-tuned version of BgGPT-7B-Instruct-v0.2 for a propaganda detection task. It is effectively a multilabel classifier, determining wether a given propaganda text in English contains or not 5 predefined propaganda types.
|
27 |
+
This model was created by [`Identrics`](https://identrics.ai/), in the scope of the WASPer project.
|
28 |
|
29 |
|
30 |
## Propaganda taxonomy
|
|
|
52 |
|
53 |
## Uses
|
54 |
|
55 |
+
To be used as a multilabel classifier to identify if the English sample text contains one or more of the five propaganda techniques mentioned above.
|
56 |
|
57 |
### Example
|
58 |
|
|
|
69 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
70 |
|
71 |
model = AutoModelForSequenceClassification.from_pretrained("identrics/BG_propaganda_classifier", num_labels=5)
|
72 |
+
tokenizer = AutoTokenizer.from_pretrained("identrics/EN_propaganda_classifier")
|
73 |
|
74 |
tokens = tokenizer("Our country is the most powerful country in the world!", return_tensors="pt")
|
75 |
output = model(**tokens)
|
|
|
91 |
|
92 |
## Training Details
|
93 |
|
|
|
94 |
|
95 |
+
During the training stage, our objective is to train the multi-label classifier on different types of propaganda using a dataset that includes both real and artificially generated samples. In the case of English, there are 214 organic examples and 206 synthetic examples.
|
96 |
+
The data is carefully classified by domain experts based on our predetermined taxonomy, which covers five primary classifications. Certain examples are classified under just one category, others have several different groups, highlighting the complex structure of propaganda, where multiple techniques can be found inside a single text.
|
97 |
|
|
|
98 |
|
99 |
|
100 |
+
The model was then tested on a smaller evaluation dataset, achieving an f1 score of
|
101 |
|
|