--- language: - ca license: apache-2.0 tags: - catalan - masked-lm - distilroberta widget: - text: El Català és una llengua molt . - text: Salvador Dalí va viure a . - text: La Costa Brava té les millors d'Espanya. - text: El cacaolat és un batut de . - text: és la capital de la Garrotxa. - text: Vaig al a buscar bolets. - text: Antoni Gaudí vas ser un molt important per la ciutat. - text: Catalunya és una referència en a nivell europeu. --- # DistilRoBERTa-base-ca ## Model description This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. This makes the model lighter and faster than the original, at the cost of a slightly lower performance. ## Training ### Training procedure This model has been trained using a technique known as Knowledge Distillation, which is used to shrink networks to a reasonable size while minimizing the loss in performance. It basically consists in distilling a large language model (the teacher) into a more lightweight, energy-efficient, and production-friendly model (the student). So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. As a result, the student has lower inference time and the ability to run in commodity hardware. ### Training data The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below: | Corpus | Size (GB) | |--------------------------|-----------:| | Catalan Crawling | 13.00 | | RacoCatalá | 8.10 | | Catalan Oscar | 4.00 | | CaWaC | 3.60 | | Cat. General Crawling | 2.50 | | Wikipedia | 1.10 | | DOGC | 0.78 | | Padicat | 0.63 | | ACN | 0.42 | | Nació Digital | 0.42 | | Cat. Government Crawling | 0.24 | | Vilaweb | 0.06 | | Catalan Open Subtitles | 0.02 | | Tweets | 0.02 | ## Evaluation ### Evaluation benchmark This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets: | Dataset | Task| Total | Train | Dev | Test | |:----------|:----|:--------|:-------|:------|:------| | AnCora | NER | 13,581 | 10,628 | 1,427 | 1,526 | | AnCora | POS | 16,678 | 13,123 | 1,709 | 1,846 | | STS-ca | STS | 3,073 | 2,073 | 500 | 500 | | TeCla | TC | 137,775 | 110,203| 13,786| 13,786| | TE-ca | RTE | 21,163 | 16,930 | 2,116 | 2,117 | | CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 | | XQuAD-ca | QA | - | - | - | 1,189 | ### Evaluation results This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks: | Model \ Task |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca 1 (F1/EM) | | ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------| | RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 | | DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 | 1 : Trained on CatalanQA, tested on XQuAD-ca (no train set).