TajaKuzman commited on
Commit
64dd067
·
verified ·
1 Parent(s): 2e7b1ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -1
README.md CHANGED
@@ -107,7 +107,7 @@ tags:
107
 
108
  # Multilingual IPTC Media Topic Classifier
109
 
110
- Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
111
  and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
112
  Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
113
 
@@ -115,6 +115,9 @@ The model can be used for classification into topic labels from the
115
  [IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
116
  applied to any news text in a language, supported by the `xlm-roberta-large`.
117
 
 
 
 
118
  ## Intended use and limitations
119
 
120
  For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
@@ -228,6 +231,20 @@ The model was shown to achieve accuracy of 0.78 and macro-F1 scores of 0.72. The
228
  | Slovenian | 0.80212 | 0.736939 | 283 |
229
  | Greek | 0.792388 | 0.725062 | 289 |
230
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
  ### Fine-tuning hyperparameters
232
 
233
  Fine-tuning was performed with `simpletransformers`.
 
107
 
108
  # Multilingual IPTC Media Topic Classifier
109
 
110
+ News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
111
  and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
112
  Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
113
 
 
115
  [IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
116
  applied to any news text in a language, supported by the `xlm-roberta-large`.
117
 
118
+ Based on out multilingual manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
119
+ the model achieves accuracy of 0.836 and macro-F1 scores of 0.778 on instances, predicted with confidence level above 0.90.
120
+
121
  ## Intended use and limitations
122
 
123
  For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
 
231
  | Slovenian | 0.80212 | 0.736939 | 283 |
232
  | Greek | 0.792388 | 0.725062 | 289 |
233
 
234
+ For downstream tasks, **we advise you to use only labels that were predicted with confidence score
235
+ higher than 0.90 which further improves the performance**.
236
+
237
+ When we remove instances, predicted with lower confidence from the test set (229 instances - 20%), the scores are the following:
238
+
239
+ | Language | Accuracy | Macro-F1 | No. of instances |
240
+ |:-------|-----------:|-----------:|-----------:|
241
+ | All (combined) | 0.835738 | 0.778166 | 901 |
242
+ | | | | |
243
+ | Croatian | 0.82906 | 0.767518 | 234 |
244
+ | Catalan | 0.836735 | 0.75111 | 196 |
245
+ | Slovenian | 0.835443 | 0.783873 | 237 |
246
+ | Greek | 0.84188 | 0.785525 | 234 |
247
+
248
  ### Fine-tuning hyperparameters
249
 
250
  Fine-tuning was performed with `simpletransformers`.