TajaKuzman commited on
Commit
fd4cef1
·
verified ·
1 Parent(s): b37fe62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -108,8 +108,7 @@ tags:
108
  # Multilingual IPTC Media Topic Classifier
109
 
110
  Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
111
- and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), automatically annotated by the OpenAI's GPT-4o
112
- model with the [top-level IPTC
113
  Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
114
 
115
  The model can be used for classification into topic labels from the
@@ -198,7 +197,7 @@ and enriched with information which specific subtopics belong to the top-level t
198
 
199
  The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
200
  The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
201
- The training dataset was automatically annotated with the IPTC Media Topic labels by the GPT-4o model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
202
 
203
  Label distribution in the training dataset:
204
 
 
108
  # Multilingual IPTC Media Topic Classifier
109
 
110
  Text classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
111
+ and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
 
112
  Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
113
 
114
  The model can be used for classification into topic labels from the
 
197
 
198
  The model was fine-tuned on a training dataset consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
199
  The news texts were extracted from the [MaCoCu web corpora](https://macocu.eu/) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
200
+ The training dataset was automatically annotated with the IPTC Media Topic labels by the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (with prediction accuracy of 0.78 and macro-F1 scores of 0.72).
201
 
202
  Label distribution in the training dataset:
203