Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
TajaKuzman commited on
Commit
c7f9bfc
·
verified ·
1 Parent(s): 68c4a0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -277,13 +277,17 @@ trained on all three datasets, outperforms classifiers that were trained on just
277
 
278
  Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
279
  of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
280
- The X-GINCO dataset comprises 790 instances in 10 languages -
281
  Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
282
  To evaluate the performance on genre labels, the dataset is balanced by labels,
283
  and the vague label "Other" is not included.
284
  Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
 
 
285
  The evaluation shows high cross-lingual performance of the model,
286
  even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
 
 
287
  The outlier is Maltese, on which classifier does not perform well -
288
  we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
289
 
 
277
 
278
  Additionally, we evaluated the X-GENRE classifier on a multilingual X-GINCO dataset that comprises samples
279
  of texts from the MaCoCu web corpora (http://hdl.handle.net/11356/1969).
280
+ The X-GINCO dataset comprises 790 manually-annotated instances in 10 languages -
281
  Albanian, Croatian, Catalan, Greek, Icelandic, Macedonian, Maltese, Slovenian, Turkish, and Ukrainian.
282
  To evaluate the performance on genre labels, the dataset is balanced by labels,
283
  and the vague label "Other" is not included.
284
  Additionally, instances that were predicted with a confidence score below 0.80 were not included in the test dataset.
285
+
286
+
287
  The evaluation shows high cross-lingual performance of the model,
288
  even when applied to languages that are not related to the training languages (English and Slovenian) and when applied on non-Latin scripts.
289
+
290
+
291
  The outlier is Maltese, on which classifier does not perform well -
292
  we presume that this is due to the fact that Maltese is not included in the pretraining data of the XLM-RoBERTa model.
293