Roberta-eus Euscrawl large cased
This is a RoBERTa model for Basque model presented in Does corpus quality really matter for low-resource languages?. There are several models for Basque using the RoBERTa architecture, using different corpora:
- roberta-eus-euscrawl-base-cased: Basque RoBERTa model trained on Euscrawl, a corpus created using tailored crawling from Basque sites. EusCrawl contains 12,528k documents and 423M tokens.
- roberta-eus-euscrawl-large-cased: RoBERTa large trained on EusCrawl.
- roberta-eus-mC4-base-cased: Basque RoBERTa model trained on the Basque portion of mc4 dataset.
- roberta-eus-CC100-base-cased: Basque RoBERTa model trained on Basque portion of cc100 dataset.
The models have been tested on five different downstream tasks for Basque: Topic classification, Sentiment analysis, Stance detection, Named Entity Recognition (NER), and Question Answering (refer to the paper for more details). See summary of results below:
Model | Topic class. | Sentiment | Stance det. | NER | QA | Average |
---|---|---|---|---|---|---|
roberta-eus-euscrawl-base-cased | 76.2 | 77.7 | 57.4 | 86.8 | 34.6 | 66.5 |
roberta-eus-euscrawl-large-cased | 77.6 | 78.8 | 62.9 | 87.2 | 38.3 | 69.0 |
roberta-eus-mC4-base-cased | 75.3 | 80.4 | 59.1 | 86.0 | 35.2 | 67.2 |
roberta-eus-CC100-base-cased | 76.2 | 78.8 | 63.4 | 85.2 | 35.8 | 67.9 |
If you use any of these models, please cite the following paper:
@misc{artetxe2022euscrawl,
title={Does corpus quality really matter for low-resource languages?},
author={Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri,
Olatz Perez-de-Viñaspre, Aitor Soroa},
year={2022},
eprint={2203.08111},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 110
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.