Explanations
Browse files
README.md
CHANGED
@@ -61,7 +61,7 @@ In order to test our hyphothesis, we first calculated the perplexity of each doc
|
|
61 |
<caption>Figure 2. Perplexity distributions and quarties (red lines) of 100M samples of mc4-es.</caption>
|
62 |
</figure>
|
63 |
|
64 |
-
With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of excluding samples that were neither too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3). The first function was a `stepwise` that simple oversampled the central quartiles using que quartiles boundaries and a factor for how heavily these should be oversampled. The second function was a gaussian approximation of the `stepwise` function to smoth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4). We adjusted the `factor` parameter of the `stepwise` function, and the `factor` and `width` parameter of the `gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomply `mc-4` up to 50M samples as well.
|
65 |
|
66 |
|
67 |
<figure>
|
@@ -113,11 +113,36 @@ The `random` sampling also displayed the same perplexity distribution of the und
|
|
113 |
|
114 |
We then used the same setup as in Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continue training the most promising model for 25k more on sequence length of 512.
|
115 |
|
|
|
|
|
116 |
## Results
|
117 |
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
|
|
|
120 |
|
|
|
121 |
## Team members
|
122 |
|
123 |
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|
|
|
61 |
<caption>Figure 2. Perplexity distributions and quarties (red lines) of 100M samples of mc4-es.</caption>
|
62 |
</figure>
|
63 |
|
64 |
+
With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of excluding samples that were neither too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3). The first function was a `stepwise` that simple oversampled the central quartiles using que quartiles boundaries and a factor for how heavily these should be oversampled. The second function was a gaussian approximation of the `stepwise` function to smoth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4). We adjusted the `factor` parameter of the `stepwise` function, and the `factor` and `width` parameter of the `gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomply `mc-4` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
|
65 |
|
66 |
|
67 |
<figure>
|
|
|
113 |
|
114 |
We then used the same setup as in Liu et al. (2019) but trained only for half the steps (250k) on a sequence length of 128. Then, we continue training the most promising model for 25k more on sequence length of 512.
|
115 |
|
116 |
+
Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` but a small factor to oversample everything.
|
117 |
+
|
118 |
## Results
|
119 |
|
120 |
+
Our first test, tagged `beta` in this repository, refers to an initial experiment using `stepwise` on 128 sequence lengths but a small `factor` to oversample everything. During the community event, the Barcelona Supercomputing Center in association with the National Library of Spain released RoBERTa base and large models trained on 200M documents (570GB) of high quality data clean using 100 nodes with 48 CPU cores of MareNostrum 4 during 96h. At the end of the process they were left with 2TB of clean data at the document level that further cleaned up to the final 570GB. In all our experiments and procedures, we had access to 3xTPUv3-8 for 10 days to do cleaning, sampling, taining, and evaluation. The BSC team evaluated our early release of the model `beta` and the results can be seen in Table 1. We are no waiting for the evaluation on the rest of our experiments to finish. The final models were trained on different number of steps and sequence lengths and achieve different masked word prediction accuracies. Some of the datasets used for evaluation are not freely available, therefore we are not in position to verify the figures.
|
121 |
+
|
122 |
+
<figure>
|
123 |
+
|
124 |
+
| Dataset | Metric | RoBERTa-b | RoBERTa-l | BETO | mBERT | BERTIN |
|
125 |
+
|-------------|----------|-----------|-----------|--------|--------|--------|
|
126 |
+
| UD-POS | F1 | 0.9907 | 0.9901 | 0.9900 | 0.9886 | 0.9904 |
|
127 |
+
| Conll-NER | F1 | 0.8851 | 0.8772 | 0.8759 | 0.8691 | 0.8627 |
|
128 |
+
| Capitel-POS | F1 | 0.9846 | 0.9851 | 0.9836 | 0.9839 | 0.9826 |
|
129 |
+
| Capitel-NER | F1 | 0.8959 | 0.8998 | 0.8771 | 0.8810 | 0.8741 |
|
130 |
+
| STS | Combined | 0.8423 | 0.8420 | 0.8216 | 0.8249 | 0.7822 |
|
131 |
+
| MLDoc | Accuracy | 0.9595 | 0.9600 | 0.9650 | 0.9560 | 0.9673 |
|
132 |
+
| PAWS-X | F1 | 0.9035 | 0.9000 | 0.8915 | 0.9020 | 0.8820 |
|
133 |
+
| XNLI | Accuracy | 0.8016 | WiP | 0.8130 | 0.7876 | WiP |
|
134 |
+
|
135 |
+
|
136 |
+
<caption>Table 1. Evaluation made by the Barcelona Supercomputing Center of their models and BERTIN (beta).</caption>
|
137 |
+
</figure>
|
138 |
+
|
139 |
+
# Conclusions
|
140 |
+
|
141 |
+
With roughly 10 days to access to TPUs, we have achieve remarkable results surpassing previous state of the art in a few tasks, and even improving document classification on models trained in massive supercomputers with humongous private and highly curated datasets.
|
142 |
|
143 |
+
The expericence has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits of access.
|
144 |
|
145 |
+
We hope our work set the basis for more small teams playing and experimenting with language models training on small subsets of data and for shorter times, since the performance of our models is on par with those trained on big machines for long times.
|
146 |
## Team members
|
147 |
|
148 |
- Javier de la Rosa ([versae](https://huggingface.co/versae))
|