Update README.md
Browse files
README.md
CHANGED
@@ -1,48 +1,105 @@
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
tags:
|
4 |
-
-
|
5 |
-
|
6 |
-
-
|
7 |
-
|
8 |
---
|
9 |
|
10 |
-
|
11 |
-
probably proofread and complete it, then remove this comment. -->
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
This model is a fine-tuned version of [cornelius/partypress-multilingual](https://huggingface.co/cornelius/partypress-multilingual) on an unknown dataset.
|
16 |
-
It achieves the following results on the evaluation set:
|
17 |
|
18 |
|
19 |
## Model description
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
## Intended uses & limitations
|
24 |
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
## Training
|
28 |
|
29 |
-
|
|
|
|
|
30 |
|
31 |
## Training procedure
|
32 |
|
33 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
-
|
36 |
-
- optimizer: None
|
37 |
-
- training_precision: float32
|
38 |
|
39 |
-
###
|
40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
|
43 |
-
### Framework versions
|
44 |
|
45 |
-
- Transformers 4.28.0
|
46 |
-
- TensorFlow 2.12.0
|
47 |
-
- Datasets 2.12.0
|
48 |
-
- Tokenizers 0.13.3
|
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
3 |
+
language:
|
4 |
+
- de
|
5 |
+
- en
|
6 |
+
- es
|
7 |
+
- da
|
8 |
+
- pl
|
9 |
+
- sv
|
10 |
+
- nl
|
11 |
+
metrics:
|
12 |
+
- accuracy
|
13 |
+
pipeline_tag: text-classification
|
14 |
tags:
|
15 |
+
- partypress
|
16 |
+
- political science
|
17 |
+
- parties
|
18 |
+
- press releases
|
19 |
---
|
20 |
|
21 |
+
# PARTYPRESS multilingual
|
|
|
22 |
|
23 |
+
Fine-tuned model in seven languages on texts from nine countries, based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased). Used in Erfort et al. (2023), building on the PARTYPRESS database.
|
|
|
|
|
|
|
24 |
|
25 |
|
26 |
## Model description
|
27 |
|
28 |
+
The PARTYPRESS multilingual model builds on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) but has a supervised component. This means, it was fine-tuned using texts labeled by humans. The labels indicate 23 different political issue categories derived from the Comparative Agendas Project (CAP).
|
29 |
+
|
30 |
+
|
31 |
+
## Model variations
|
32 |
+
|
33 |
+
We plan to release monolingual models for each of the languages covered by this multilingual model.
|
34 |
|
35 |
## Intended uses & limitations
|
36 |
|
37 |
+
The main use of the model is for text classification of press releases from political parties. It may also be useful for other political texts.
|
38 |
+
|
39 |
+
The classification can then be used to measure which issues parties are discussing in their communication.
|
40 |
+
|
41 |
+
### How to use
|
42 |
+
|
43 |
+
This model can be used directly with a pipeline for text classification:
|
44 |
+
|
45 |
+
```python
|
46 |
+
>>> from transformers import pipeline
|
47 |
+
>>> partypress = pipeline("text-classification", model = "cornelius/partypress-multilingual", tokenizer = "cornelius/partypress-multilingual")
|
48 |
+
>>> partypress("We urgently need to fight climate change and reduce carbon emissions. This is what our party stands for.")
|
49 |
+
|
50 |
+
```
|
51 |
+
|
52 |
+
### Limitations and bias
|
53 |
+
|
54 |
+
The model was trained with data from parties in nine countries. For use in other countries, the model may be further fine-tuned. Without further fine-tuning, the performance of the model may be lower.
|
55 |
+
|
56 |
+
The model may have biased predictions. We discuss some biases by country, party, and over time in the release paper for the PARTYPRESS database.
|
57 |
|
58 |
+
## Training data
|
59 |
|
60 |
+
The PARTYPRESS multilingual model was fine-tuned with 27,243 press releases in seven languages on texts from 68 European parties in nine countries. The press releases were labeled by two expert human coders per country.
|
61 |
+
|
62 |
+
For the training data of the underlying model, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
|
63 |
|
64 |
## Training procedure
|
65 |
|
66 |
+
### Preprocessing
|
67 |
+
|
68 |
+
For the preprocessing, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
|
69 |
+
|
70 |
+
### Pretraining
|
71 |
+
|
72 |
+
For the pretraining, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
|
73 |
+
|
74 |
+
### Fine-tuning
|
75 |
+
|
76 |
+
|
77 |
+
|
78 |
+
## Evaluation results
|
79 |
+
|
80 |
+
Fine-tuned on our downstream task, this model achieves the following results in a five-fold cross validation that are comparable to the performance of our expert human coders:
|
81 |
+
|
82 |
+
| Accuracy | Precision | Recall | F1 score |
|
83 |
+
|:--------:|:---------:|:-------:|:--------:|
|
84 |
+
| 69.52 | 67.99 | 67.60 | 66.77 |
|
85 |
+
|
86 |
+
Note that the classification task is difficult because topics such as environment and energy are often difficult to keep apart.
|
87 |
|
88 |
+
When we aggregate the shares of text for each issue, we find that the root-mean-square error is very low (0.29).
|
|
|
|
|
89 |
|
90 |
+
### BibTeX entry and citation info
|
91 |
|
92 |
+
```bibtex
|
93 |
+
@article{erfort_partypress_2023,
|
94 |
+
author = {Cornelius Erfort and
|
95 |
+
Lukas F. Stoetzer and
|
96 |
+
Heike Klüver},
|
97 |
+
title = {The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases},
|
98 |
+
journal = {Research and Politics},
|
99 |
+
volume = {forthcoming},
|
100 |
+
year = {2023},
|
101 |
+
}
|
102 |
+
```
|
103 |
|
104 |
|
|
|
105 |
|
|
|
|
|
|
|
|