Update README.md
Browse files
README.md
CHANGED
@@ -75,9 +75,7 @@ In order to test our hypothesis, we first calculated the perplexity of each docu
|
|
75 |
With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
|
76 |
|
77 |
The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
|
78 |
-
The second function weighted the perplexity distribution by a Gaussian-like
|
79 |
-
function, to smooth out the sharp boundaries of the `Stepwise` function and
|
80 |
-
give a better approximation to the desired underlying distribution (see Figure 4).
|
81 |
|
82 |
We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
|
83 |
|
@@ -86,18 +84,18 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
|
|
86 |
|
87 |
![](./images/perp-resample.png)
|
88 |
|
89 |
-
<caption>Figure 3. Expected perplexity distributions of the sample
|
90 |
</figure>
|
91 |
|
92 |
<figure>
|
93 |
|
94 |
![](./images/perp-resample-gaussian.png)
|
95 |
|
96 |
-
<caption>Figure 4. Expected perplexity distributions of the sample
|
97 |
</figure>
|
98 |
|
99 |
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for
|
100 |
-
each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the
|
101 |
|
102 |
```python
|
103 |
from datasets import load_dataset
|
@@ -118,9 +116,7 @@ for config in ("random", "stepwise", "gaussian"):
|
|
118 |
|
119 |
![](./images/datasets-perp.png)
|
120 |
|
121 |
-
<caption>Figure 5. Experimental perplexity distributions of the sampled
|
122 |
-
`mc4-es` after applying `Gaussian` and `Stepwise` functions, and the `Random`
|
123 |
-
control sample.</caption>
|
124 |
</figure>
|
125 |
|
126 |
`Random` sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.
|
@@ -129,7 +125,7 @@ control sample.</caption>
|
|
129 |
|
130 |
![](./images/datasets-random-comparison.png)
|
131 |
|
132 |
-
<caption>Figure 6. Experimental perplexity distribution of the sampled
|
133 |
</figure>
|
134 |
|
135 |
|
|
|
75 |
With the extracted perplexity percentiles, we created two functions to oversample the central quartiles with the idea of biasing against samples that are either too small (short, repetitive texts) or too long (potentially poor quality) (see Figure 3).
|
76 |
|
77 |
The first function is a `Stepwise` that simply oversamples the central quartiles using quartile boundaries and a factor for the desired sampling frequency for each quartile, obviously given larger frequencies for middle quartiles (oversampling Q2, Q3, subsampling Q1, Q4).
|
78 |
+
The second function weighted the perplexity distribution by a Gaussian-like function, to smooth out the sharp boundaries of the `Stepwise` function and give a better approximation to the desired underlying distribution (see Figure 4).
|
|
|
|
|
79 |
|
80 |
We adjusted the `factor` parameter of the `Stepwise` function, and the `factor` and `width` parameter of the `Gaussian` function to roughly be able to sample 50M samples from the 416M in `mc4-es` (see Figure 4). For comparison, we also sampled randomly `mC4-es` up to 50M samples as well. In terms of sizes, we went down from 1TB of data to ~200GB.
|
81 |
|
|
|
84 |
|
85 |
![](./images/perp-resample.png)
|
86 |
|
87 |
+
<caption>Figure 3. Expected perplexity distributions of the sample mc4-es after applying the Stepwise function.</caption>
|
88 |
</figure>
|
89 |
|
90 |
<figure>
|
91 |
|
92 |
![](./images/perp-resample-gaussian.png)
|
93 |
|
94 |
+
<caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
|
95 |
</figure>
|
96 |
|
97 |
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for
|
98 |
+
each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the bertin-project/mc4-es-sampled dataset. Since the validation set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same "factor" and width, in our experiments we decided to sample from the training sets. In the bertin-project/mc4-es-sampled dataset, the validation set pulls the samples from the original mc4.
|
99 |
|
100 |
```python
|
101 |
from datasets import load_dataset
|
|
|
116 |
|
117 |
![](./images/datasets-perp.png)
|
118 |
|
119 |
+
<caption>Figure 5. Experimental perplexity distributions of the sampled mc4-es after applying Gaussian and Stepwise functions, and the Random control sample.</caption>
|
|
|
|
|
120 |
</figure>
|
121 |
|
122 |
`Random` sampling displayed the same perplexity distribution of the underlying true distribution, as can be seen in Figure 6.
|
|
|
125 |
|
126 |
![](./images/datasets-random-comparison.png)
|
127 |
|
128 |
+
<caption>Figure 6. Experimental perplexity distribution of the sampled mc4-es after applying Random sampling.</caption>
|
129 |
</figure>
|
130 |
|
131 |
|