TajaKuzman
commited on
Commit
•
fc64d8f
1
Parent(s):
a6bd902
Update README.md
Browse files
README.md
CHANGED
@@ -126,7 +126,7 @@ base_model:
|
|
126 |
# Multilingual IPTC Media Topic Classifier
|
127 |
|
128 |
News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
|
129 |
-
and fine-tuned on a news corpus in 4 languages (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
|
130 |
Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
|
131 |
|
132 |
The model can be used for classification into topic labels from the
|
@@ -215,7 +215,7 @@ and enriched with information which specific subtopics belong to the top-level t
|
|
215 |
|
216 |
## Training data
|
217 |
|
218 |
-
The model was fine-tuned on
|
219 |
The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
|
220 |
The training dataset was automatically annotated with the IPTC Media Topic labels by
|
221 |
the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).
|
|
|
126 |
# Multilingual IPTC Media Topic Classifier
|
127 |
|
128 |
News topic classification model based on [`xlm-roberta-large`](https://huggingface.co/FacebookAI/xlm-roberta-large)
|
129 |
+
and fine-tuned on a [news corpus in 4 languages](http://hdl.handle.net/11356/1991) (Croatian, Slovenian, Catalan and Greek), annotated with the [top-level IPTC
|
130 |
Media Topic NewsCodes labels](https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html).
|
131 |
|
132 |
The model can be used for classification into topic labels from the
|
|
|
215 |
|
216 |
## Training data
|
217 |
|
218 |
+
The model was fine-tuned on the training split of the [EMMediaTopic 1.0 dataset](http://hdl.handle.net/11356/1991) consisting of 15,000 news in four languages (Croatian, Slovenian, Catalan and Greek).
|
219 |
The news texts were extracted from the [MaCoCu-Genre web corpora](http://hdl.handle.net/11356/1969) based on the "News" genre label, predicted with the [X-GENRE classifier](https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier).
|
220 |
The training dataset was automatically annotated with the IPTC Media Topic labels by
|
221 |
the [GPT-4o](https://platform.openai.com/docs/models/gpt-4o) model (yielding 0.72 micro-F1 and 0.73 macro-F1 on the test dataset).
|