|
--- |
|
language: |
|
- es |
|
thumbnail: "url to a thumbnail used in social sharing" |
|
license: apache-2.0 |
|
datasets: |
|
- oscar |
|
--- |
|
|
|
# SELECTRA: A Spanish ELECTRA |
|
|
|
SELECTRA is a Spanish pre-trained language model based on [ELECTRA](https://github.com/google-research/electra). |
|
We release a `small` and `medium` version with the following configuration: |
|
|
|
| Model | Layers | Embedding/Hidden Size | Params | Vocab Size | Max Sequence Length | Cased | |
|
| --- | --- | --- | --- | --- | --- | --- | |
|
| [SELECTRA small](https://huggingface.co/Recognai/selectra_small) | 12 | 256 | 22M | 50k | 512 | True | |
|
| **SELECTRA medium** | **12** | **384** | **41M** | **50k** | **512** | **True** | |
|
|
|
**SELECTRA small (medium) is about 5 (3) times smaller than BETO but achieves comparable results** (see Metrics section below). |
|
|
|
## Usage |
|
|
|
From the original [ELECTRA model card](https://huggingface.co/google/electra-small-discriminator): "ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN." |
|
The discriminator should therefore activate the logit corresponding to the fake input token, as the following example demonstrates: |
|
|
|
```python |
|
from transformers import ElectraForPreTraining, ElectraTokenizerFast |
|
|
|
discriminator = ElectraForPreTraining.from_pretrained("Recognai/selectra_small") |
|
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small") |
|
|
|
sentence_with_fake_token = "Estamos desayunando pan rosa con tomate y aceite de oliva." |
|
|
|
inputs = tokenizer.encode(sentence_with_fake_token, return_tensors="pt") |
|
logits = discriminator(inputs).logits.tolist()[0] |
|
|
|
print("\t".join(tokenizer.tokenize(sentence_with_fake_token))) |
|
print("\t".join(map(lambda x: str(x)[:4], logits[1:-1]))) |
|
"""Output: |
|
Estamos desayun ##ando pan rosa con tomate y aceite de oliva . |
|
-3.1 -3.6 -6.9 -3.0 0.19 -4.5 -3.3 -5.1 -5.7 -7.7 -4.4 -4.2 |
|
""" |
|
``` |
|
|
|
However, you probably want to use this model to fine-tune it on a downstream task. |
|
We provide models fine-tuned on the [XNLI dataset](https://huggingface.co/datasets/xnli), which can be used together with the zero-shot classification pipeline: |
|
|
|
- [Zero-shot SELECTRA small](https://huggingface.co/Recognai/zeroshot_selectra_small) |
|
- [Zero-shot SELECTRA medium](https://huggingface.co/Recognai/zeroshot_selectra_medium) |
|
|
|
## Metrics |
|
|
|
We fine-tune our models on 3 different down-stream tasks: |
|
|
|
- [XNLI](https://huggingface.co/datasets/xnli) |
|
- [PAWS-X](https://huggingface.co/datasets/paws-x) |
|
- [CoNLL2002 - NER](https://huggingface.co/datasets/conll2002) |
|
|
|
For each task, we conduct 5 trials and state the mean and standard deviation of the metrics in the table below. |
|
To compare our results to other Spanish language models, we provide the same metrics taken from the [evaluation table](https://github.com/PlanTL-SANIDAD/lm-spanish#evaluation-) of the [Spanish Language Model](https://github.com/PlanTL-SANIDAD/lm-spanish) repo. |
|
|
|
| Model | CoNLL2002 - NER (f1) | PAWS-X (acc) | XNLI (acc) | Params | |
|
| --- | --- | --- | --- | --- | |
|
| SELECTRA small | 0.865 +- 0.004 | 0.896 +- 0.002 | 0.784 +- 0.002 | 22M | |
|
| SELECTRA medium | 0.873 +- 0.003 | 0.896 +- 0.002 | 0.804 +- 0.002 | 41M | |
|
| | | | | | |
|
| [mBERT](https://huggingface.co/bert-base-multilingual-cased) | 0.8691 | 0.8955 | 0.7876 | 178M | |
|
| [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 0.8759 | 0.9000 | 0.8130 | 110M | |
|
| [RoBERTa-b](https://huggingface.co/BSC-TeMU/roberta-base-bne) | 0.8851 | 0.9000 | 0.8016 | 125M | |
|
| [RoBERTa-l](https://huggingface.co/BSC-TeMU/roberta-large-bne) | 0.8772 | 0.9060 | 0.7958 | 355M | |
|
| [Bertin](https://huggingface.co/bertin-project/bertin-roberta-base-spanish/tree/v1-512) | 0.8835 | 0.8990 | 0.7890 | 125M | |
|
| [ELECTRICIDAD](https://huggingface.co/mrm8488/electricidad-base-discriminator) | 0.7954 | 0.9025 | 0.7878 | 109M | |
|
|
|
Some details of our fine-tuning runs: |
|
- epochs: 5 |
|
- batch-size: 32 |
|
- learning rate: 1e-4 |
|
- warmup proportion: 0.1 |
|
- linear learning rate decay |
|
- layerwise learning rate decay |
|
|
|
For all the details, check out our [selectra repo](https://github.com/recognai/selectra). |
|
|
|
## Training |
|
|
|
We pre-trained our SELECTRA models on the Spanish portion of the [Oscar](https://huggingface.co/datasets/oscar) dataset, which is about 150GB in size. |
|
Each model version is trained for 300k steps, with a warm restart of the learning rate after the first 150k steps. |
|
Some details of the training: |
|
- steps: 300k |
|
- batch-size: 128 |
|
- learning rate: 5e-4 |
|
- warmup steps: 10k |
|
- linear learning rate decay |
|
- TPU cores: 8 (v2-8) |
|
|
|
For all details, check out our [selectra repo](https://github.com/recognai/selectra). |
|
|
|
**Note:** Due to a misconfiguration in the pre-training scripts the embeddings of the vocabulary containing an accent were not optimized. If you fine-tune this model on a down-stream task, you might consider using a tokenizer that does not strip the accents: |
|
```python |
|
tokenizer = ElectraTokenizerFast.from_pretrained("Recognai/selectra_small", strip_accents=False) |
|
``` |
|
|
|
## Motivation |
|
|
|
Despite the abundance of excellent Spanish language models (BETO, BSC-BNE, Bertin, ELECTRICIDAD, etc.), we felt there was still a lack of distilled or compact Spanish language models and a lack of comparing those to their bigger siblings. |
|
|
|
## Acknowledgment |
|
|
|
This research was supported by the Google TPU Research Cloud (TRC) program. |
|
|
|
## Authors |
|
|
|
- David Fidalgo ([GitHub](https://github.com/dcfidalgo)) |
|
- Javier Lopez ([GitHub](https://github.com/javispp)) |
|
- Daniel Vila ([GitHub](https://github.com/dvsrepo)) |
|
- Francisco Aranda ([GitHub](https://github.com/frascuchon)) |