classla
/

multilingual-IPTC-news-topic-classifier

Text Classification

topic categorization

Model card Files Files and versions Community

Taja Kuzman commited on Aug 9, 2024

Commit

64dd067

·

verified ·

1 Parent(s): 2e7b1ad

Update README.md

Files changed (1) hide show

README.md +18 -1

README.md CHANGED Viewed

@@ -107,7 +107,7 @@ tags:
 # Multilingual IPTC Media Topic Classifier
-Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
 and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
@@ -115,6 +115,9 @@ The model can be used for classification into topic labels from the
 [IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
  applied to any news text in a language, supported by the `xlm-roberta-large`.
 ## Intended use and limitations
 For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
@@ -228,6 +231,20 @@ The model was shown to achieve accuracy of 0.78	and macro-F1 scores of 0.72. The
 | Slovenian     |   0.80212  |   0.736939 |   283 |
 | Greek     |   0.792388 |   0.725062 |   289 |
 ### Fine-tuning hyperparameters
 Fine-tuning was performed with `simpletransformers`.

 # Multilingual IPTC Media Topic Classifier
+News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
 and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
 [IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
  applied to any news text in a language, supported by the `xlm-roberta-large`.
+Based on out multilingual manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
+the model achieves accuracy of 0.836 and macro-F1 scores of 0.778 on instances, predicted with confidence level above 0.90.
 ## Intended use and limitations
 For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
 | Slovenian     |   0.80212  |   0.736939 |   283 |
 | Greek     |   0.792388 |   0.725062 |   289 |
+For downstream tasks, **we advise you to use only labels that were predicted with confidence score
+higher than 0.90 which further improves the performance**.
+When we remove instances, predicted with lower confidence from the test set (229 instances - 20%), the scores are the following:
+| Language   |   Accuracy |   Macro-F1 | No. of instances |
+|:-------|-----------:|-----------:|-----------:|
+| All (combined)    |   0.835738 |   0.778166 |   901 |
+|     |    |    |   |
+| Croatian     |   0.82906 |   0.767518 |   234 |
+| Catalan     |   0.836735 |   0.75111 |   196 |
+| Slovenian     |   0.835443  |   0.783873 |   237 |
+| Greek     |   0.84188 |   0.785525 |   234 |
 ### Fine-tuning hyperparameters
 Fine-tuning was performed with `simpletransformers`.