classla
/

multilingual-IPTC-news-topic-classifier

Text Classification

topic categorization

Model card Files Files and versions Community

Taja Kuzman commited on Dec 6, 2024

Commit

fc64d8f

·

verified ·

1 Parent(s): a6bd902

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -126,7 +126,7 @@ base_model:
 # Multilingual IPTC Media Topic Classifier
 News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
-and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
 The model can be used for classification into topic labels from the
@@ -215,7 +215,7 @@ and enriched with information which specific subtopics belong to the top-level t
 ## Training data
-The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
 The training dataset was automatically annotated with the IPTC Media Topic labels by
 the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).

 # Multilingual IPTC Media Topic Classifier
 News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
+and fine-tuned on a [news corpus in 4 languages](http://hdl.handle.net/11356/1991) (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
 The model can be used for classification into topic labels from the
 ## Training data
+The model was fine-tuned on the training split of the [EMMediaTopic 1.0 dataset](http://hdl.handle.net/11356/1991) consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
 The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
 The training dataset was automatically annotated with the IPTC Media Topic labels by
 the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).