classla
/

multilingual-IPTC-news-topic-classifier

@@ -134,8 +134,9 @@ The model can be used for classification into topic labels from the
  applied to any news text in a language, supported by the `xlm-roberta-large`.
 Based on a manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
-the model achieves micro-F1 score of 0.734, macro-F1 score of 0.746 and accuracy of 0.734,
-and outperforms the GPT-4o model (version `gpt-4o-2024-05-13`) used in a zero-shot setting.
 ## Intended use and limitations
@@ -241,33 +242,54 @@ Label distribution in the training dataset:
 ## Performance
-The model was evaluated on a manually-annotated test set in four languages (Croatian, Slovenian, Catalan and Greek), consisting of 1,130 instances.
 The test set contains similar amounts of texts from the four languages and is more or less balanced across labels.
-The model was shown to achieve accuracy of 0.78	and macro-F1 scores of 0.72. The results for the entire test set and per language:
-| Language   |   Accuracy |   Macro-F1 | No. of instances |
-|:-------|-----------:|-----------:|-----------:|
-| All (combined)    |   0.784071 |   0.723079 |   1130 |
-|     |    |    |   |
-| Croatian     |   0.786942 |   0.732721 |   291 |
-| Catalan     |   0.752809 |   0.676812 |   267 |
-| Slovenian     |   0.80212  |   0.736939 |   283 |
-| Greek     |   0.792388 |   0.725062 |   289 |
 For downstream tasks, **we advise you to use only labels that were predicted with confidence score
-higher than 0.90 which further improves the performance**.
-When we remove instances predicted with lower confidence (229 instances - 20%), the scores are the following:
-| Language   |   Accuracy |   Macro-F1 | No. of instances |
-|:-------|-----------:|-----------:|-----------:|
-| All (combined)    |   0.835738 |   0.778166 |   901 |
-|     |    |    |   |
-| Croatian     |   0.82906 |   0.767518 |   234 |
-| Catalan     |   0.836735 |   0.75111 |   196 |
-| Slovenian     |   0.835443  |   0.783873 |   237 |
-| Greek     |   0.84188 |   0.785525 |   234 |
 ## Fine-tuning hyperparameters

  applied to any news text in a language, supported by the `xlm-roberta-large`.
 Based on a manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
+the model achieves micro-F1 score of 0.733, macro-F1 score of 0.745 and accuracy of 0.733,
+and outperforms the GPT-4o model (version `gpt-4o-2024-05-13`) used in a zero-shot setting.
+ If we use only labels that are predicted with a confidence score equal or higher than 0.90, the model achieves micro-F1 and macro-F1 of 0.80.
 ## Intended use and limitations
 ## Performance
+The model was evaluated on a manually-annotated test set in four languages (Croatian, Slovenian, Catalan and Greek),
+ consisting of 1,129 instances.
 The test set contains similar amounts of texts from the four languages and is more or less balanced across labels.
+The model was shown to achieve micro-F1 score of 0.733, and macro-F1 score of 0.745. The results for the entire test set and per language:
+|    |   Micro-F1 |   Macro-F1 |   Accuracy | No. of instances |
+|:---|-----------:|-----------:|-----------:|-----------:|
+| All (combined)    |   0.733392 |   0.744633 | 0.733392 |  1129 |
+| Croatian |   0.728522 |   0.733725 |   0.728522 | 291 |
+| Catalan |   0.715356 |   0.722304 |   0.715356 | 267 |
+| Slovenian |   0.758865 |   0.764784 |   0.758865 | 282 |
+| Greek |   0.730104 |   0.742099 |   0.730104 | 289 |
+Performance per label:
+|                                           |   precision |   recall |   f1-score |   support |
+|:------------------------------------------|------------:|---------:|-----------:|----------:|
+| arts, culture, entertainment and media    |       0.602 |    0.875 |      0.713 |    64     |
+| conflict, war and peace                   |       0.611 |    0.917 |      0.733 |    36     |
+| crime, law and justice                    |       0.862 |    0.812 |      0.836 |    69     |
+| disaster, accident and emergency incident |       0.691 |    0.887 |      0.777 |    53     |
+| economy, business and finance             |       0.779 |    0.508 |      0.615 |   118     |
+| education                                 |       0.847 |    0.735 |      0.787 |    68     |
+| environment                               |       0.589 |    0.754 |      0.662 |    57     |
+| health                                    |       0.797 |    0.797 |      0.797 |    59     |
+| human interest                            |       0.552 |    0.673 |      0.607 |    55     |
+| labour                                    |       0.855 |    0.831 |      0.843 |    71     |
+| lifestyle and leisure                     |       0.769 |    0.465 |      0.58  |    86     |
+| politics                                  |       0.568 |    0.735 |      0.641 |    68     |
+| religion                                  |       0.842 |    0.941 |      0.889 |    51     |
+| science and technology                    |       0.638 |    0.8   |      0.71  |    55     |
+| society                                   |       0.918 |    0.5   |      0.647 |   112     |
+| sport                                     |       0.824 |    0.968 |      0.891 |    63     |
+| weather                                   |       0.932 |    0.932 |      0.932 |    44     |
 For downstream tasks, **we advise you to use only labels that were predicted with confidence score
+higher or equal to 0.90 which further improves the performance**.
+When we remove instances predicted with lower confidence (229 instances - 20%), the model yields micro-F1 of 0.798 and macro-F1 of 0.80.
+|    |   Micro-F1 |   Macro-F1 |   Accuracy |
+|:---|-----------:|-----------:|-----------:|
+| All (combined)    |   0.797777 |   0.802403 | 0.797777 |
+| Croatian |   0.773504 |   0.772084 |   0.773504 |
+| Catalan |   0.811224 |   0.806885 |   0.811224 |
+| Slovenian |   0.805085 |   0.804491 |   0.805085 |
+| Greek |   0.803419 |   0.809598 |   0.803419 |
 ## Fine-tuning hyperparameters