classla
/

multilingual-IPTC-news-topic-classifier

Text Classification

topic categorization

Model card Files Files and versions Community

Taja Kuzman commited on Aug 8, 2024

Commit

fd4cef1

·

verified ·

1 Parent(s): b37fe62

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -108,8 +108,7 @@ tags:
 # Multilingual IPTC Media Topic Classifier
 Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
-and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), automatically annotated by the OpenAI's GPT-4o
-model with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
 The model can be used for classification into topic labels from the
@@ -198,7 +197,7 @@ and enriched with information which specific subtopics belong to the top-level t
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
-The training dataset was automatically annotated with the IPTC Media Topic labels by the GPT-4o model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
 Label distribution in the training dataset:

 # Multilingual IPTC Media Topic Classifier
 Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
+and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
 The model can be used for classification into topic labels from the
 The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
+The training dataset was automatically annotated with the IPTC Media Topic labels by the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
 Label distribution in the training dataset: