Update README.md
Browse files
README.md
CHANGED
@@ -277,13 +277,17 @@ trained on all three datasets, outperforms classifiers that were trained on just
|
|
277 |
|
278 |
Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
|
279 |
of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
|
280 |
-
The X-GINCO dataset comprises 790 instances in 10 languages -
|
281 |
Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
|
282 |
To evaluate the performance on genre labels, the dataset is balanced by labels,
|
283 |
and the vague label "Other" is not included.
|
284 |
Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
|
|
|
|
|
285 |
The evaluation shows high cross-lingual performance of the model,
|
286 |
even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
|
|
|
|
|
287 |
The outlier is Maltese, on which classifier does not perform well -
|
288 |
we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
|
289 |
|
|
|
277 |
|
278 |
Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
|
279 |
of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
|
280 |
+
The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
|
281 |
Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
|
282 |
To evaluate the performance on genre labels, the dataset is balanced by labels,
|
283 |
and the vague label "Other" is not included.
|
284 |
Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
|
285 |
+
|
286 |
+
|
287 |
The evaluation shows high cross-lingual performance of the model,
|
288 |
even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
|
289 |
+
|
290 |
+
|
291 |
The outlier is Maltese, on which classifier does not perform well -
|
292 |
we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
|
293 |
|