File size: 20,933 Bytes
d988382 02513c3 d988382 3f4b8d4 d988382 07c4abf 8fb54c7 a701d85 984e69d a701d85 af3cb2c d988382 8fb54c7 af3cb2c 5ba7dcf 6c97ec0 af3cb2c d988382 5ba7dcf 8fb54c7 5ba7dcf d988382 8fb54c7 5ba7dcf 8fb54c7 5ba7dcf 8fb54c7 6c97ec0 8fb54c7 5ba7dcf 8fb54c7 6c97ec0 8fb54c7 5ba7dcf 6c97ec0 5ba7dcf 8fb54c7 5ba7dcf 8fb54c7 5ba7dcf 8fb54c7 6c97ec0 8fb54c7 7bf4003 8fb54c7 7bf4003 8fb54c7 7bf4003 8fb54c7 6c97ec0 8fb54c7 5ba7dcf 8fb54c7 5ba7dcf 8fb54c7 d6a0dc0 8fb54c7 0731b1a d6a0dc0 0731b1a c45bad9 8fb54c7 3bf5e63 cd3cf59 295ff9b 96e881d 96fc0c7 295ff9b 540ab3f 295ff9b a7e4332 295ff9b 0731b1a 96e881d 96fc0c7 71df214 0731b1a 295ff9b c45bad9 477b47c 984e69d 477b47c c45bad9 477b47c 1bcdbad c45bad9 96e881d 96fc0c7 1bcdbad c45bad9 1bcdbad 477b47c 1bcdbad 96e881d 96fc0c7 1bcdbad 477b47c 5de6912 1bcdbad 96e881d 96fc0c7 1bcdbad 5de6912 1bcdbad 5de6912 1bcdbad 1d12789 477b47c 6a64207 30f98d9 96e881d 96fc0c7 30f98d9 540ab3f 30f98d9 540ab3f 30f98d9 5de6912 96e881d 5de6912 96fc0c7 5de6912 295ff9b cd3cf59 8fb54c7 6c97ec0 8fb54c7 af3cb2c d988382 02513c3 af3cb2c d988382 af3cb2c 8fb54c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 |
---
language: es
license: CC-BY 4.0
tags:
- spanish
- roberta
pipeline_tag: fill-mask
widget:
- text: "Fui a la librería a comprar un <mask>."
---
- Version 1 (beta): July 15th, 2021
- Version 1: July 19th, 2021
# Motivation
According to [Wikipedia](https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers), Spanish is the second most-spoken language in the world by native speakers (>470 million speakers, only after Chinese, and the fourth including those who speak it as a second language). However, most NLP research is still mainly available in English. Relevant contributions like BERT, XLNet or GPT2 sometimes take years to be available in Spanish and, when they do, it is often via multilanguage versions which are not as performant as the English alternative.
At the time of the event there were no RoBERTa models available in Spanish. Therefore, releasing one such model was the primary goal of our project. During the Flax/JAX Community Event we released a beta version of our model, which was the first in Spanish language. Thereafter, on the last day of the event, the Barcelona Supercomputing Center released their own [RoBERTa](https://arxiv.org/pdf/2107.07253.pdf) model. The precise timing suggests our work precipitated this publication, and such increase in competition is a desired outcome of our project. We are grateful for their efforts to include BERTIN in their paper, as discussed further below, and recognize the value of their own contribution, which we also acknowledge in our experiments.
Models in Spanish are hard to come by and, when they do, they are often trained on proprietary datasets and with massive resources. In practice, this means that many relevant algorithms and techniques remain exclusive to large technological corporations. This motivates the second goal of our project, which is to bring training of large models like RoBERTa one step closer to smaller groups. We want to explore technieque that make training this architectures easier and faster, thus contributing to the democratization of Deep Learning.
# BERTIN
BERTIN is a series of BERT-based models for Spanish. The current model hub points to the best of all RoBERTa-base models trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax). All code and scripts are included.
This is part of the
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google Cloud.
The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
## Spanish mC4
mC4 is a multilingual variant of the C4, the Colossal, Cleaned version of Common Crawl's web crawl corpus. While C4 was used to train the T5 text-to-text Transformer models, mC4 comprises natural text in 101 languages drawn from the public Common Crawl web-scrape and was used to train mT5, the multilingual version of T5.
The Spanish portion of mC4 (`mc4-es`) contains about 416 million samples and 235 billion words in approximately 1TB of uncompressed data.
```bash
$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l
416057992
```
```bash
$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc
235303687795
```
## Perplexity sampling
The large amount of text in mC4-es makes training a language model within the time constraints of the Flax/JAX Community Event by HuggingFace problematic. This motivated the exploration of sampling methods, with the goal of creating a subset of the dataset that allows well-performing training with roughly one eighth of the data (~50M samples) and in approximately half the training steps.
In order to efficiently build this subset of data, we decided to leverage a technique we call *perplexity sampling* and whose origin can be traced to the construction of CCNet (Wenzek et al., 2020) and their work extracting high quality monolingual datasets from web-crawl data. In their work, they suggest the possibility of applying fast language-models trained on high-quality data such as Wikipedia to filter out texts that deviate too much from correct expressions of a language (see Figure 1). They also released Kneser-Ney models for 100 languages (Spanish included) as implemented in the KenLM library (Heafield, 2011) and trained on their respective Wikipedias.
<figure>
![](./images/ccnet.png)
<caption>Figure 1. Perplexity distributions by percentage CCNet corpus.</caption>
</figure>
In this work, we tested the hypothesis that perplexity sampling might help
reduce training-data size and training times, while keeping the performance of
the final model.
## Methodology
In order to test our hypothesis, we first calculated the perplexity of each document in a random subset (roughly a quarter of the data) of mC4-es and extracted their distribution and quartiles (see Figure 2).
<figure>
![](./images/perp-p95.png)
<caption>Figure 2. Perplexity distributions and quartiles (red lines) of 44M samples of mc4-es.</caption>
</figure>
With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
The second function weighted the perplexity distribution by a Gaussian-like
function, to smooth out the sharp boundaries of the `Stepwise` function and
give a better approximation to the desired underlying distribution (see Figure 4).
We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
<figure>
![](./images/perp-resample.png)
<caption>Figure 3. Expected perplexity distributions of the sample `mc4-es` after applying the `Stepwise` function.</caption>
</figure>
<figure>
![](./images/perp-resample-gaussian.png)
<caption>Figure 4. Expected perplexity distributions of the sample `mc4-es` after applying `Gaussian` function.</caption>
</figure>
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for
each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same `factor` and `width`, in our experiments we decided to sample from the training sets. In the `bertin-project/mc4-es-sampled` dataset, the `validation` set pulls the samples from the original `mc4`.
```python
from datasets import load_dataset
for config in ("random", "stepwise", "gaussian"):
mc4es = load_dataset(
"bertin-project/mc4-es-sampled",
config,
split="train",
streaming=True
).shuffle(buffer_size=1000)
for sample in mc4es:
print(config, sample)
break
```
<figure>
![](./images/datasets-perp.png)
<caption>Figure 5. Experimental perplexity distributions of the sampled
`mc4-es` after applying `Gaussian` and `Stepwise` functions, and the `Random`
control sample.</caption>
</figure>
`Random` sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.
<figure>
![](./images/datasets-random-comparison.png)
<caption>Figure 6. Experimental perplexity distribution of the sampled `mc4-es` after applying `Random` sampling.</caption>
</figure>
We then used the same setup as Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. In particular, `Gaussian` trained for the 250k steps, while `Random` was stopped at 230k and `Stepwise` at 180k (this was a decision based on an analysis of training performance and the computational resources available at the time).
Then, we continued training the most promising model for a few steps (~25k) more on sequence length 512. We tried two strategies for this, since it is not easy to find clear details about this change in the literature. It turns out this decision had a big impact in the final performance.
For `Random` sampling we trained with seq len 512 during the last 20 steps of the 250 training steps, keeping the optimizer state intact. Results for this are underwhelming, as seen in Figure 7:
<figure>
![](./images/random_512.jpg)
<caption>Figure 7. Training profile for Random sampling. Note the drop in performance after the change from 128 to 512 sequence lenght.</caption>
</figure>
For `Gaussian` sampling we started a new optimizer after 230 steps with 128 seq len, using a short warmup interval. Results are much better using this procedure. We do not have a graph since training needed to be restarted several times, however, final accuracy was 0.6873 compared to 0.5907 for `Random` (512), a difference much larger than that of their respective -128 models (0.6520 for `Random`, 0.6608 for `Gaussian`).
## Results
Our first test, tagged `beta` in this repository, refers to an initial experiment using `Stepwise` on 128 sequence length and trained for 210k steps. Two nearly identical versions of this model can be found, one at **bertin-roberta-base-spanish** and the other at **flax-community/bertin-roberta-large-spanish** (do note this is **not our best model**!). During the community event, the Barcelona Supercomputing Center (BSC) in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that were further cleaned up to the final 570GB. This is an interesting contrast to our own resources (3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation) and makes for a valuable reference. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1.
Our final models were trained on a different number of steps and sequence lengths and achieve different—higher—masked-word prediction accuracies. Despite these limitations it is interesting to see the results they obtained using the early version of our model. Note that some of the datasets used for evaluation by BSC are not freely available, therefore it is not possible to verify the figures.
<figure>
<caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta, seq len 128).</caption>
| Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN |
|-------------|----------|-----------|-----------|--------|--------|--------|
| UD-POS | F1 | **0.9907** | 0.9901 | 0.9900 | 0.9886 | **0.9904** |
| Conll-NER | F1 | 0.8851 | 0.8772 | 0.8759 | 0.8691 | 0.8627 |
| Capitel-POS | F1 | 0.9846 | 0.9851 | 0.9836 | 0.9839 | 0.9826 |
| Capitel-NER | F1 | 0.8959 | 0.8998 | 0.8771 | 0.8810 | 0.8741 |
| STS | Combined | 0.8423 | 0.8420 | 0.8216 | 0.8249 | 0.7822 |
| MLDoc | Accuracy | 0.9595 | 0.9600 | 0.9650 | 0.9560 | **0.9673** |
| PAWS-X | F1 | 0.9035 | 0.9000 | 0.8915 | 0.9020 | 0.8820 |
| XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
</figure>
All of our models attained good accuracy values, in the range of 0.65, as can be seen in Table 2:
<figure>
<caption>Table 2. Accuracy for the different language models.</caption>
| Model | Accuracy |
|----------------------------------------------------|----------|
| bertin-project/bertin-roberta-base-spanish | 0.6547 |
| bertin-project/bertin-base-random | 0.6520 |
| bertin-project/bertin-base-stepwise | 0.6487 |
| bertin-project/bertin-base-gaussian | 0.6608 |
| bertin-project/bertin-base-random-exp-512seqlen | 0.5907 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | **0.6873** |
</figure>
We are currently in the process of applying our language models to downstream tasks.
<figure>
<caption>Table x.</caption>
| Dataset | Metric | BERT-m | BERT-wwm | BSC-BNE | Beta | Random | Stepwise | Gaussian | Random-512 | Gaussian-512 |
|----------|---------------|--------|----------|----------|--------|---------|------------|-----------|--------------|---------------|
| CoNLL 2002-POS | F1 | 0.9629 | 0.9642 | 0.9659 | 0.9638 | 0.9656 | 0.9656 | **0.9662** | 0.9660 | **0.9662** |
| CoNLL 2002-POS | F1 | 0.9687 | 0.9700 | 0.9707 | 0.9690 | 0.9704 | 0.9707 | 0.9709 | 0.9707 | **0.9714** |
</figure>
## SQUAD-es
Using sequence length 128 we have achieved exact match 50.96 and F1 68.74.
## POS
All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.
<figure>
<caption>Table 3. Results for POS.</caption>
| Model | F1 | Accuracy |
|----------------------------------------------------|----------|----------|
| bert-base-multilingual-cased | 0.9629 | 0.9687 |
| dccuchile/bert-base-spanish-wwm-cased | 0.9642 | 0.9700 |
| BSC-TeMU/roberta-base-bne | 0.9659 | 0.9707 |
| bertin-project/bertin-roberta-base-spanish | 0.9638 | 0.9690 |
| bertin-project/bertin-base-random | 0.9656 | 0.9704 |
| bertin-project/bertin-base-stepwise | 0.9656 | 0.9707 |
| bertin-project/bertin-base-gaussian | **0.9662** | 0.9709 |
| bertin-project/bertin-base-random-exp-512seqlen | 0.9660 | 0.9707 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | **0.9662** | **0.9714** |
</figure>
## NER
All models trained with max length 512 and batch size 8, using the CoNLL 2002 dataset.
<figure>
<caption>Table 4. Results for NER.</caption>
| Model | F1 | Accuracy |
|----------------------------------------------------|----------|----------|
| bert-base-multilingual-cased | 0.8539 | 0.9779 |
| dccuchile/bert-base-spanish-wwm-cased | 0.8579 | 0.9783 |
| BSC-TeMU/roberta-base-bne | 0.8700 | 0.9807 |
| bertin-project/bertin-roberta-base-spanish | 0.8725 | 0.9812 |
| bertin-project/bertin-base-random | 0.8704 | 0.9807 |
| bertin-project/bertin-base-stepwise | 0.8705 | 0.9809 |
| bertin-project/bertin-base-gaussian | **0.8792** | **0.9816** |
| bertin-project/bertin-base-random-exp-512seqlen | 0.8616 | 0.9803 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | **0.8764** | **0.9819** |
</figure>
## PAWS-X
All models trained with max length 512 and batch size 8. These numbers are surprising both for the repeated instances of 0.5765 accuracy and for the large differences in performance. However, experiments have been repeated several times and the results are consistent.
<figure>
<caption>Table 5. Results for PAWS-X.</caption>
| Model | Accuracy |
|----------------------------------------------------|----------|
| bert-base-multilingual-cased | 0.5765 |
| dccuchile/bert-base-spanish-wwm-cased | 0.8720 |
| BSC-TeMU/roberta-base-bne | 0.5765 |
| bertin-project/bertin-roberta-base-spanish | 0.5765 |
| bertin-project/bertin-base-random | 0.8800 |
| bertin-project/bertin-base-stepwise | 0.8825 |
| bertin-project/bertin-base-gaussian | 0.8875 |
| bertin-project/bertin-base-random-exp-512seqlen | 0.6735 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | **0.8965** |
</figure>
## XNLI
<figure>
<caption>Table 6. Results for XNLI with sequence length 256 and batch size 32.</caption>
| Model | Accuracy |
|----------------------------------------------------|----------|
| bert-base-multilingual-cased | 0.7852 |
| dccuchile/bert-base-spanish-wwm-cased | **0.8186** |
| BSC-TeMU/roberta-base-bne | 0.8178 |
| bertin-project/bertin-base-random | 0.7745 |
| bertin-project/bertin-base-stepwise | 0.7820 |
| bertin-project/bertin-base-gaussian | 0.7942 |
| bertin-project/bertin-base-random-exp-512seqlen | 0.7723 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7878 |
</figure>
<figure>
<caption>Table 7. Results for XNLI with sequence length 512 and batch size 16.</caption>
| Model | Accuracy |
|----------------------------------------------------|----------|
| bert-base-multilingual-cased | WIP |
| dccuchile/bert-base-spanish-wwm-cased | WIP |
| BSC-TeMU/roberta-base-bne | WIP |
| bertin-project/bertin-base-random | WIP |
| bertin-project/bertin-base-stepwise | WIP |
| bertin-project/bertin-base-gaussian | WIP |
| bertin-project/bertin-base-random-exp-512seqlen | 0.7799 |
| bertin-project/bertin-base-gaussian-exp-512seqlen | 0.7843 |
</figure>
# Conclusions
With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with very large—private—and highly curated datasets.
The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
We hope our work will set the basis for more small teams playing and
experimenting with language models training on smaller subsets of huge datasets with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
## Team members
- Javier de la Rosa ([versae](https://huggingface.co/versae))
- Eduardo González ([edugp](https://huggingface.co/edugp))
- Paulo Villegas ([paulo](https://huggingface.co/paulo))
- Pablo González de Prado ([Pablogps](https://huggingface.co/Pablogps))
- Manu Romero ([mrm8488](https://huggingface.co/))
- María Grandury ([mariagrandury](https://huggingface.co/))
## Useful links
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
- [Community Week thread](https://discuss.huggingface.co/t/bertin-pretrain-roberta-large-from-scratch-in-spanish/7125)
- [Community Week channel](https://discord.com/channels/858019234139602994/859113060068229190)
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
## References
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, Proceedings of the 12th Language Resources and Evaluation Conference (LREC), p. 4003-4012, May 2020.
- Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP2011 Sixth Workshop on Statistical Machine Translation.
|