Roberta-eus Euscrawl base cased

This is a RoBERTa model for Basque model presented in Does corpus quality really matter for low-resource languages?. There are several models for Basque using the RoBERTa architecture, which are pre-trained using different corpora:

  • roberta-eus-euscrawl-base-cased: Basque RoBERTa trained on Euscrawl, a corpus created using tailored crawling from Basque sites. EusCrawl contains 12,528k documents and 423M tokens.
  • roberta-eus-euscrawl-large-cased: Basque RoBERTa large trained on EusCrawl.
  • roberta-eus-mC4-base-cased: Basque RoBERTa trained on the Basque portion of mc4 dataset.
  • roberta-eus-CC100-base-cased: Basque RoBERTa trained on Basque portion of cc100 dataset.

The models have been tested on five different downstream tasks for Basque: Topic classification, Sentiment analysis, Stance detection, Named Entity Recognition (NER), and Question Answering (refer to the paper for more details). See summary of results below:

Model Topic class. Sentiment Stance det. NER QA Average
roberta-eus-euscrawl-base-cased 76.2 77.7 57.4 86.8 34.6 66.5
roberta-eus-euscrawl-large-cased 77.6 78.8 62.9 87.2 38.3 69.0
roberta-eus-mC4-base-cased 75.3 80.4 59.1 86.0 35.2 67.2
roberta-eus-CC100-base-cased 76.2 78.8 63.4 85.2 35.8 67.9

If you use any of these models, please cite the following paper:

@misc{artetxe2022euscrawl,
 title={Does corpus quality really matter for low-resource languages?},
 author={Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri,
         Olatz Perez-de-Viñaspre, Aitor Soroa},
 year={2022},
 eprint={2203.08111},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}
Downloads last month
31
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.