Paulo commited on
Commit
6c97ec0
1 Parent(s): 71df214

minor style changes

Browse files
Files changed (1) hide show
  1. README.md +15 -7
README.md CHANGED
@@ -19,7 +19,7 @@ BERTIN is a series of BERT-based models for Spanish. The current model hub point
19
  This is part of the
20
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google Cloud.
21
 
22
- The aim of this project was to pre-train a RoBERTa-base model from scratch for during the Flax/JAX Community Event in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
23
 
24
  ## Spanish mC4
25
 
@@ -50,7 +50,9 @@ In order to efficiently build this subset of data, we decided to leverage a tech
50
  <caption>Figure 1. Perplexity distributions by percentage CCNet corpus.</caption>
51
  </figure>
52
 
53
- In this work, we tested the hypothesis that perplexity sampling might help reduce training-data size and training times.
 
 
54
 
55
  ## Methodology
56
 
@@ -60,13 +62,15 @@ In order to test our hypothesis, we first calculated the perplexity of each docu
60
 
61
  ![](./images/perp-p95.png)
62
 
63
- <caption>Figure 2. Perplexity distributions and quartiles (red lines) of 100M samples of mc4-es.</caption>
64
  </figure>
65
 
66
  With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
67
 
68
  The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
69
- The second function was a Gaussian approximation of the `Stepwise` function to smooth out the sharp boundaries and give a better approximation of the underlying distribution (see Figure 4).
 
 
70
 
71
  We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
72
 
@@ -85,7 +89,8 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
85
  <caption>Figure 4. Expected perplexity distributions of the sample `mc4-es` after applying `Gaussian` function.</caption>
86
  </figure>
87
 
88
- Figure 5 shows the perplexity distributions of the 50M subsets for each of the approximations. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same `factor` and `width`, in our experiments we decided to sample from the training sets. In the `bertin-project/mc4-es-sampled` dataset, the `validation` set pulls the samples from the original `mc4`.
 
89
 
90
  ```python
91
  from datasets import load_dataset
@@ -106,7 +111,9 @@ for split in ("random", "stepwise", "gaussian"):
106
 
107
  ![](./images/datasets-perp.png)
108
 
109
- <caption>Figure 5. Experimental perplexity distributions of the sampled `mc4-es` after applying `Gaussian` and `Stepwise` functions.</caption>
 
 
110
  </figure>
111
 
112
  `Random` sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.
@@ -270,7 +277,8 @@ With roughly 10 days worth of access to 3xTPUv3-8, we have achieved remarkable r
270
 
271
  The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
272
 
273
- We hope our work will set the basis for more small teams playing and experimenting with language models training on small subsets of data with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
 
274
 
275
  ## Team members
276
 
 
19
  This is part of the
20
  [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google Cloud.
21
 
22
+ The aim of this project was to pre-train a RoBERTa-base model from scratch during the Flax/JAX Community Event, in which Google Cloud provided free TPUv3-8 to do the training using Huggingface's Flax implementations of their library.
23
 
24
  ## Spanish mC4
25
 
 
50
  <caption>Figure 1. Perplexity distributions by percentage CCNet corpus.</caption>
51
  </figure>
52
 
53
+ In this work, we tested the hypothesis that perplexity sampling might help
54
+ reduce training-data size and training times, while keeping the performance of
55
+ the final model.
56
 
57
  ## Methodology
58
 
 
62
 
63
  ![](./images/perp-p95.png)
64
 
65
+ <caption>Figure 2. Perplexity distributions and quartiles (red lines) of 44M samples of mc4-es.</caption>
66
  </figure>
67
 
68
  With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
69
 
70
  The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
71
+ The second function weighted the perplexity distribution by a Gaussian-like
72
+ function, to smooth out the sharp boundaries of the `Stepwise` function and
73
+ give a better approximation to the desired underlying distribution (see Figure 4).
74
 
75
  We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
76
 
 
89
  <caption>Figure 4. Expected perplexity distributions of the sample `mc4-es` after applying `Gaussian` function.</caption>
90
  </figure>
91
 
92
+ Figure 5 shows the actual perplexity distributions of the generated 50M subsets for
93
+ each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same `factor` and `width`, in our experiments we decided to sample from the training sets. In the `bertin-project/mc4-es-sampled` dataset, the `validation` set pulls the samples from the original `mc4`.
94
 
95
  ```python
96
  from datasets import load_dataset
 
111
 
112
  ![](./images/datasets-perp.png)
113
 
114
+ <caption>Figure 5. Experimental perplexity distributions of the sampled
115
+ `mc4-es` after applying `Gaussian` and `Stepwise` functions, and the `Random`
116
+ control sample.</caption>
117
  </figure>
118
 
119
  `Random` sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.
 
277
 
278
  The experience has been incredible and we feel this kind of events provide an amazing opportunity for small teams on low or non-existent budgets to learn how the big players in the field pre-train their models. The trade-off between learning and experimenting, and being beta-testers of libraries (Flax/JAX) and infrastructure (TPU VMs) is a marginal cost to pay compared to the benefits such access has to offer.
279
 
280
+ We hope our work will set the basis for more small teams playing and
281
+ experimenting with language models training on smaller subsets of huge datasets with reduced training times, since the performance of our models is on par with those trained on big machines for longer times.
282
 
283
  ## Team members
284