cornelius commited on
Commit
8b3bb1d
1 Parent(s): f962a0e

Upload TFBertForSequenceClassification

Browse files
Files changed (3) hide show
  1. README.md +24 -139
  2. config.json +1 -1
  3. tf_model.h5 +1 -1
README.md CHANGED
@@ -1,163 +1,48 @@
1
  ---
2
  license: cc-by-sa-4.0
3
- language:
4
- - de
5
- - en
6
- - es
7
- - da
8
- - pl
9
- - sv
10
- - nl
11
- metrics:
12
- - accuracy
13
- pipeline_tag: text-classification
14
  tags:
15
- - partypress
16
- - political science
17
- - parties
18
- - press releases
19
- widget:
20
- - text: "Labour launches air pollution campaign Labour’s Shadow Minister for the Natural Environment, will today (Wednesday) launch Labour’s campaign against air pollution. Maria Eagle will say that 29,000 people die prematurely in the UK each year because of poor air pollution in our towns and cities - including more than 3,000 in London. Scientists have warned that air pollution in Britain’s most polluted cities is stunting the development of children’s lungs. Maria Eagle MP and Sadiq Khan MP will announce that the next Labour Government will deliver a national framework for Low Emission Zones to enable local authorities to encourage cleaner, greener, less-polluting vehicles to begin to tackle this problem. Unlike this Tory-led Government the Labour Party will devolve the power, not just the responsibility, to Local Authorities willing take action against air pollution."
21
- - text: "Dawn Butler MP, Labour’s Shadow\nWomen and Equalities Secretary, commenting on Equal Pay Day today, said: “From today onwards women effectively work the rest of the year for free, which means fifty days of unpaid labour until we hit 2018. “The fact that the gender pay gap has remained the same for the past three years is a shocking indictment of the Government’s failure to tackle unequal pay and the underlying structural issues that allow these disparities to exist. “Labour is the party of equality. It was Labour that introduced the Equal Pay Act in 1970 and the next Labour Government will take the necessary action to end the scourge of unequal pay once and for all."
22
- - text: '"Los soldados españoles que están en Afganistán cuentan con las máximas medidas de seguridad para su protección" Jesús Cuadrado, portavoz socialista en la Comisión de Defensa del Congreso, ha enviado esta tarde, en nombre del PSOE, un mensaje de apoyo y solidaridad a los soldados heridos hoy en Afganistán y a sus familias, así como a sus compañeros en esta misión internacional. Cuadrado ha resaltado el “enorme sacrificio” que supone para los soldados una misión de estas características. Un sacrificio “que está al servicio de todo los españoles”, ha explicado, “porque contribuyen a la creación de un Estado”, en un lugar que ha sido usado hasta el momento por los terroristas para cometer atentados en su país y en resto del mundo. Gracias al trabajo de “nuestros soldados”, ha añadido, “ahora hay un ejército compuesto por militares afganos y hay una policía”. Así, “el trabajo de los militares españoles está al servicio de España y de los demás países”, que participan en esta misión por mandato de la OTAN, ha recordado. “Es uno de los trabajos más solidarios y comprometidos que se pueden hacer en el mundo”, ha resaltado el portavoz socialista. La seguridad de nuestras tropas, una prioridad absoluta “La seguridad al cien por cien es imposible”, ha admitido Cuadrado. Así lo ha señalado en numerosas ocasiones tanto el Ministerio de Defensa como el resto del Gobierno. “Los riesgos que asumimos allí son muy elevados y, precisamente por eso, durante estos años, el Gobierno ha hecho de la seguridad de nuestras tropas una prioridad absoluta”. Cuadrado ha recordado que “todos los blindados que utilizan los militares españoles en Afganistán han sido renovados”. Los soldados españoles, en concreto, cuentan con 67 RG31 y 131 blindados tipo Lince. También se ha construido una nueva base y se ha dotado, a todo el material que utilizan, de las más modernas medidas de seguridad, así como de unos servicios sanitarios de alto nivel. “Igualmente, es conocido que se ha ido mejorando el sistema de transporte de nuestras tropas con las mejores medidas de seguridad”, concluyó el portavoz en la Comisión de Defensa.'
23
- - text: 'Nederland kan uit de crisis komen als we banen en structureel herstel voorop stellen. Dat zei Sybrand Buma vandaag in zijn speech op het partijcongres van het CDA in Leeuwarden. Dat was volgens hem de aanpak van Ruud Lubbers in de crisis van de jaren 80, en het was de aanpak van Jan Peter Balkenende aan het begin van deze eeuw. Beide CDA’ers waren er van overtuigd dat werk en economisch herstel niet worden gecreëerd door de overheid. Maar de overheid kan wel de omstandigheden scheppen waaronder de economie weer tot leven komt. Die visie van Lubbers en Balkenende is ook de visie van Buma en ligt ten grondslag aan het alternatief dat hij begin september presenteerde. Buma benadrukte ook dat het CDA zowel het kabinet als D66, CU en SGP zal blijven aansporen om de chronisch zieken en gehandicapten meer tegemoet te komen. Het herfstakkoord is in strijd met alles waar het CDA vanaf de jaren ’80, tot en met het Strategisch Beraad, voor staat. Het CDA wil een bloeiende samenleving met een eerlijke economie en een dienende overheid als schild voor de zwakken. Buma stond er ook nog even bij stil dat Leeuwarden voor hem een bijzondere plek is, om twee redenen. Hij is er geboren, en het is de plaats van waaruit de Elfstedentocht begint. Buma heeft zelf de Elfstedentocht gereden, bijna 30 jaar geleden, en is nog steeds rijdend lid van de vereniging. Tijdens zijn speech maakte hij bekend dat hij zijn startbewijs inleverde om ruimte te maken voor een jongere, mocht de tocht binnenkort weer gereden worden.Klik op de link op onderaan deze pagina om de hele speech te lezen.'
24
  ---
25
 
26
- # PARTYPRESS multilingual
 
27
 
28
- Fine-tuned model in seven languages on texts from nine countries (Austria, Denmark, Germany, Ireland, Netherlands, Poland, Spain, Sweden, UK), based on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased). Used in Erfort et al. (2023), building on the PARTYPRESS database. For the downstream task of classyfing press releases from political parties into 23 unique policy areas we achieve a performance comparable to expert human coders.
 
 
 
29
 
30
 
31
  ## Model description
32
 
33
- The PARTYPRESS multilingual model builds on [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) but has a supervised component. This means, it was fine-tuned using texts labeled by humans. The labels indicate 23 different political issue categories derived from the Comparative Agendas Project (CAP):
34
-
35
- | Code | Issue |
36
- |--|-------|
37
- | 1 | Macroeconomics |
38
- | 2 | Civil Rights |
39
- | 3 | Health |
40
- | 4 | Agriculture |
41
- | 5 | Labor |
42
- | 6 | Education |
43
- | 7 | Environment |
44
- | 8 | Energy |
45
- | 9 | Immigration |
46
- | 10 | Transportation |
47
- | 12 | Law and Crime |
48
- | 13 | Social Welfare |
49
- | 14 | Housing |
50
- | 15 | Domestic Commerce |
51
- | 16 | Defense |
52
- | 17 | Technology |
53
- | 18 | Foreign Trade |
54
- | 19.1 | International Affairs |
55
- | 19.2 | European Union |
56
- | 20 | Government Operations |
57
- | 23 | Culture |
58
- | 98 | Non-thematic |
59
- | 99 | Other |
60
-
61
- ## Model variations
62
-
63
- We plan to release monolingual models for each of the languages covered by this multilingual model. The model can be easily extended to other languages, country contexts, or time periods by fine-tuning it with minimal additional labeled texts.
64
 
65
  ## Intended uses & limitations
66
 
67
- The main use of the model is for text classification of press releases from political parties. It may also be useful for other political texts.
68
-
69
- The classification can then be used to measure which issues parties are discussing in their communication.
70
-
71
- ### How to use
72
-
73
- This model can be used directly with a pipeline for text classification:
74
-
75
- ```python
76
- >>> from transformers import pipeline
77
- >>> tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
78
- >>> partypress = pipeline("text-classification", model = "cornelius/partypress-multilingual", tokenizer = "cornelius/partypress-multilingual", **tokenizer_kwargs)
79
- >>> partypress(["We urgently need to fight climate change and reduce carbon emissions. This is what our party stands for.",
80
- "We urge all parties to end the violence and come to the table. This conflict between the two countries must end.",
81
- "Así, “el trabajo de los militares españoles está al servicio de España y de los demás países”, que participan en esta misión por mandato de la OTAN, ha recordado.",
82
- "Dass es immer noch einen Gender-Pay-Gap gibt, geht auf das Konto dieser Regierung."])
83
-
84
- [{'label': '7 - Environment', 'score': 0.9664431810379028},
85
- {'label': '19.1 - International Affairs', 'score': 0.9851641654968262},
86
- {'label': '16 - Defense', 'score': 0.986809492111206},
87
- {'label': '2 - Civil Rights', 'score': 0.9799079895019531}]
88
-
89
 
90
- ```
91
 
92
- ### Limitations and bias
93
-
94
- The model was trained with data from parties in nine countries. For use in other countries, the model may be further fine-tuned. Without further fine-tuning, the performance of the model may be lower.
95
-
96
- The model may have biased predictions. We discuss some biases by country, party, and over time in the release paper for the PARTYPRESS database. For example, the performance is highest for press releases from Ireland (75%) and lowest for Poland (55%).
97
-
98
- ## Training data
99
-
100
- The PARTYPRESS multilingual model was fine-tuned with 27,243 press releases in seven languages on texts from 68 European parties in nine countries. The press releases were labeled by two expert human coders per country.
101
-
102
- For the training data of the underlying model, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
103
 
104
  ## Training procedure
105
 
106
- ### Preprocessing
107
-
108
- For the preprocessing, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
109
-
110
- ### Pretraining
111
-
112
- For the pretraining, please refer to [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
113
-
114
- ### Fine-tuning
115
-
116
- We fine-tuned the model using 27,243 labeled press releases from political parties in seven languages.
117
-
118
- #### Training Hyperparameters
119
-
120
- The batch size for training was 12, for testing 2, with four epochs. All other hyperparameters were the standard from the transformers library.
121
-
122
- ## Evaluation results
123
-
124
- Fine-tuned on our downstream task, this model achieves the following results in a five-fold cross validation that are comparable to the performance of our expert human coders:
125
-
126
- | Accuracy | Precision | Recall | F1 score |
127
- |:--------:|:---------:|:-------:|:--------:|
128
- | 69.52 | 67.99 | 67.60 | 66.77 |
129
-
130
- Note that the classification task is difficult because topics such as environment and energy are often difficult to keep apart.
131
-
132
- When we aggregate the shares of text for each issue, we find that the root-mean-square error is very low (0.29).
133
-
134
- ### BibTeX entry and citation info
135
-
136
- ```bibtex
137
- @article{erfort_partypress_2023,
138
- author = {Cornelius Erfort and
139
- Lukas F. Stoetzer and
140
- Heike Klüver},
141
- title = {The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases},
142
- journal = {Research and Politics},
143
- volume = {forthcoming},
144
- year = {2023},
145
- }
146
- ```
147
-
148
- ### Further resources
149
-
150
- Github: [cornelius-erfort/partypress](https://github.com/cornelius-erfort/partypress)
151
 
152
- Research and Politics Dataverse: [Replication Data for: The PARTYPRESS Database: A New Comparative Database of Parties’ Press Releases](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FOINX7Q)
 
 
153
 
154
- ## Acknowledgements
155
 
156
- Research for this contribution is part of the Cluster of Excellence "Contestations of the Liberal Script" (EXC 2055, Project-ID: 390715649), funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany´s Excellence Strategy. Cornelius Erfort is moreover grateful for generous funding provided by the DFG through the Research Training Group DYNAMICS (GRK 2458/1).
157
 
158
- ## Contact
159
 
160
- Cornelius Erfort
161
- Humboldt-Universität zu Berlin
162
- [corneliuserfort.de](corneliuserfort.de)
163
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
 
 
 
3
  tags:
4
+ - generated_from_keras_callback
5
+ model-index:
6
+ - name: partypress-multilingual
7
+ results: []
 
 
 
 
 
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information Keras had access to. You should
11
+ probably proofread and complete it, then remove this comment. -->
12
 
13
+ # partypress-multilingual
14
+
15
+ This model is a fine-tuned version of [cornelius/partypress-multilingual](https://huggingface.co/cornelius/partypress-multilingual) on an unknown dataset.
16
+ It achieves the following results on the evaluation set:
17
 
18
 
19
  ## Model description
20
 
21
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
  ## Intended uses & limitations
24
 
25
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ ## Training and evaluation data
28
 
29
+ More information needed
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Training procedure
32
 
33
+ ### Training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
+ The following hyperparameters were used during training:
36
+ - optimizer: None
37
+ - training_precision: float32
38
 
39
+ ### Training results
40
 
 
41
 
 
42
 
43
+ ### Framework versions
 
 
44
 
45
+ - Transformers 4.28.0
46
+ - TensorFlow 2.12.0
47
+ - Datasets 2.12.0
48
+ - Tokenizers 0.13.3
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "bert-base-multilingual-cased",
3
  "architectures": [
4
  "BertForSequenceClassification"
5
  ],
 
1
  {
2
+ "_name_or_path": "cornelius/partypress-multilingual",
3
  "architectures": [
4
  "BertForSequenceClassification"
5
  ],
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7b54a95666573fd9b29e4d7c1e6ec4bcc92a43a078357a5d1ca6f0fa4b1f6d4f
3
  size 711772524
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b66a89087e0fbaaf44fb973c4fee016e3054fa7d3fcf7c513d2324e17c16e44f
3
  size 711772524