TajaKuzman
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -107,7 +107,7 @@ tags:
|
|
107 |
|
108 |
# Multilingual IPTC Media Topic Classifier
|
109 |
|
110 |
-
|
111 |
and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
|
112 |
Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
|
113 |
|
@@ -115,6 +115,9 @@ The model can be used for classification into topic labels from the
|
|
115 |
[IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
|
116 |
applied to any news text in a language, supported by the `xlm-roberta-large`.
|
117 |
|
|
|
|
|
|
|
118 |
## Intended use and limitations
|
119 |
|
120 |
For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
|
@@ -228,6 +231,20 @@ The model was shown to achieve accuracy of 0.78 and macro-F1 scores of 0.72. The
|
|
228 |
| Slovenian | 0.80212 | 0.736939 | 283 |
|
229 |
| Greek | 0.792388 | 0.725062 | 289 |
|
230 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
231 |
### Fine-tuning hyperparameters
|
232 |
|
233 |
Fine-tuning was performed with `simpletransformers`.
|
|
|
107 |
|
108 |
# Multilingual IPTC Media Topic Classifier
|
109 |
|
110 |
+
News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
|
111 |
and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
|
112 |
Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
|
113 |
|
|
|
115 |
[IPTC NewsCodes schema](https://iptc.org/std/NewsCodes/guidelines/#_what_are_the_iptc_newscodes) and can be
|
116 |
applied to any news text in a language, supported by the `xlm-roberta-large`.
|
117 |
|
118 |
+
Based on out multilingual manually-annotated test set (in Croatian, Slovenian, Catalan and Greek),
|
119 |
+
the model achieves accuracy of 0.836 and macro-F1 scores of 0.778 on instances, predicted with confidence level above 0.90.
|
120 |
+
|
121 |
## Intended use and limitations
|
122 |
|
123 |
For reliable results, the classifier should be applied to documents of sufficient length (the rule of thumbs is at least 75 words).
|
|
|
231 |
| Slovenian | 0.80212 | 0.736939 | 283 |
|
232 |
| Greek | 0.792388 | 0.725062 | 289 |
|
233 |
|
234 |
+
For downstream tasks, **we advise you to use only labels that were predicted with confidence score
|
235 |
+
higher than 0.90 which further improves the performance**.
|
236 |
+
|
237 |
+
When we remove instances, predicted with lower confidence from the test set (229 instances - 20%), the scores are the following:
|
238 |
+
|
239 |
+
| Language | Accuracy | Macro-F1 | No. of instances |
|
240 |
+
|:-------|-----------:|-----------:|-----------:|
|
241 |
+
| All (combined) | 0.835738 | 0.778166 | 901 |
|
242 |
+
| | | | |
|
243 |
+
| Croatian | 0.82906 | 0.767518 | 234 |
|
244 |
+
| Catalan | 0.836735 | 0.75111 | 196 |
|
245 |
+
| Slovenian | 0.835443 | 0.783873 | 237 |
|
246 |
+
| Greek | 0.84188 | 0.785525 | 234 |
|
247 |
+
|
248 |
### Fine-tuning hyperparameters
|
249 |
|
250 |
Fine-tuning was performed with `simpletransformers`.
|