---
language:
- ca
license: apache-2.0
tags:
- catalan
- masked-lm
- distilroberta
widget:
- text: El Català és una llengua molt <mask>.
- text: Salvador Dalí va viure a <mask>.
- text: La Costa Brava té les millors <mask> d'Espanya.
- text: El cacaolat és un batut de <mask>.
- text: <mask> és la capital de la Garrotxa.
- text: Vaig al <mask> a buscar bolets.
- text: Antoni Gaudí vas ser un <mask> molt important per la ciutat.
- text: Catalunya és una referència en <mask> a nivell europeu.
---

# DistilRoBERTa-base-ca

## Model description

This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). 

It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation 
from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).

The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. 
This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. 
This makes the model lighter and faster than the original, at the cost of a slightly lower performance.

## Training

### Training procedure

This model has been trained using a technique known as Knowledge Distillation, 
which is used to shrink networks to a reasonable size while minimizing the loss in performance.

It basically consists in distilling a large language model (the teacher) into a more 
lightweight, energy-efficient, and production-friendly model (the student).

So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. 
As a result, the student has lower inference time and the ability to run in commodity hardware.

### Training data

The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:

| Corpus                   | Size (GB)  |
|--------------------------|-----------:|
| Catalan Crawling         | 13.00      |
| RacoCatalá               | 8.10       |
| Catalan Oscar            | 4.00       |
| CaWaC                    | 3.60       |
| Cat. General Crawling    | 2.50       |
| Wikipedia                | 1.10       |
| DOGC                     | 0.78       |
| Padicat                  | 0.63       |
| ACN                      | 0.42       |
| Nació Digital            | 0.42       |
| Cat. Government Crawling | 0.24       |
| Vilaweb                  | 0.06       |
| Catalan Open Subtitles   | 0.02       |
| Tweets                   | 0.02       |

## Evaluation

### Evaluation benchmark

This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:

| Dataset   | Task| Total   | Train  | Dev   | Test  |
|:----------|:----|:--------|:-------|:------|:------|
| AnCora    | NER | 13,581  | 10,628 | 1,427 | 1,526 |
| AnCora    | POS | 16,678  | 13,123 | 1,709 | 1,846 |
| STS-ca    | STS | 3,073   | 2,073  | 500   | 500   |
| TeCla     | TC  | 137,775 | 110,203| 13,786| 13,786|
| TE-ca     | RTE | 21,163  | 16,930 | 2,116 | 2,117 |
| CatalanQA | QA  | 21,427  | 17,135 | 2,157 | 2,135 |
| XQuAD-ca  | QA  |   -     |   -    |   -   | 1,189 |

### Evaluation results

This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:

|      Model  \  Task     |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca <sup>1</sup> (F1/EM) | 
| ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------|
| RoBERTa-base-ca-v2      | 89.29  | 98.96  | 79.07        | 74.26      | 83.14     | 89.50/76.63     | 73.64/55.42                   |
| DistilRoBERTa-base-ca   | 87.88  | 98.83  | 77.26        | 73.20      | 76.00     | 84.07/70.77     | 62.93/45.08                   |

<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set).