versae commited on
Commit
295ff9b
1 Parent(s): 07c4abf

Explanations

Browse files
Files changed (1) hide show
  1. README.md +27 -2
README.md CHANGED
@@ -61,7 +61,7 @@ In order to test our hyphothesis, we first calculated the perplexity of each doc
61
  <caption>Figure 2. Perplexity distributions and quarties (red lines) of 100M samples of mc4-es.</caption>
62
  </figure>
63
 
64
- With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of excluding samples that were neither too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3). The first function was a `stepwise` that simple oversampled the central quartiles using que quartiles boundaries and a factor for how heavily these should be oversampled. The second function was a gaussian approximation of the `stepwise` function to smoth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4). We adjusted the `factor` parameter of the `stepwise` function, and the `factor` and `width` parameter of the `gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomply `mc-4` up to 50M samples as well.
65
 
66
 
67
  <figure>
@@ -113,11 +113,36 @@ The `random` sampling also displayed the same perplexity distribution of the und
113
 
114
  We then used the same setup as in Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continue training the most promising model for 25k more on sequence length of 512.
115
 
 
 
116
  ## Results
117
 
118
- The first version of the model...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
 
120
 
 
121
  ## Team members
122
 
123
  - Javier de la Rosa ([versae](https://huggingface.co/versae))
 
61
  <caption>Figure 2. Perplexity distributions and quarties (red lines) of 100M samples of mc4-es.</caption>
62
  </figure>
63
 
64
+ With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of excluding samples that were neither too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3). The first function was a `stepwise` that simple oversampled the central quartiles using que quartiles boundaries and a factor for how heavily these should be oversampled. The second function was a gaussian approximation of the `stepwise` function to smoth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4). We adjusted the `factor` parameter of the `stepwise` function, and the `factor` and `width` parameter of the `gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomply `mc-4` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
65
 
66
 
67
  <figure>
 
113
 
114
  We then used the same setup as in Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continue training the most promising model for 25k more on sequence length of 512.
115
 
116
+ Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` but a small factor to oversample everything.
117
+
118
  ## Results
119
 
120
+ Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1. We are no waiting for the evaluation on the rest of our experiments to finish. The final models were trained on different number of steps and sequence lengths and achieve different masked word prediction accuracies. Some of the datasets used for evaluation are not freely available, therefore we are not in position to verify the figures.
121
+
122
+ <figure>
123
+
124
+ | Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN |
125
+ |-------------|----------|-----------|-----------|--------|--------|--------|
126
+ | UD-POS | F1 | 0.9907 | 0.9901 | 0.9900 | 0.9886 | 0.9904 |
127
+ | Conll-NER | F1 | 0.8851 | 0.8772 | 0.8759 | 0.8691 | 0.8627 |
128
+ | Capitel-POS | F1 | 0.9846 | 0.9851 | 0.9836 | 0.9839 | 0.9826 |
129
+ | Capitel-NER | F1 | 0.8959 | 0.8998 | 0.8771 | 0.8810 | 0.8741 |
130
+ | STS | Combined | 0.8423 | 0.8420 | 0.8216 | 0.8249 | 0.7822 |
131
+ | MLDoc | Accuracy | 0.9595 | 0.9600 | 0.9650 | 0.9560 | 0.9673 |
132
+ | PAWS-X | F1 | 0.9035 | 0.9000 | 0.8915 | 0.9020 | 0.8820 |
133
+ | XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
134
+
135
+
136
+ <caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta).</caption>
137
+ </figure>
138
+
139
+ # Conclusions
140
+
141
+ With roughly 10 days to access to TPUs, we have achieve remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with humongous private and highly curated datasets.
142
 
143
+ The expericence has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits of access.
144
 
145
+ We hope our work set the basis for more small teams playing and experimenting with language models training on small subsets of data and for shorter times, since the performance of our models is on par with those trained on big machines for long times.
146
  ## Team members
147
 
148
  - Javier de la Rosa ([versae](https://huggingface.co/versae))