Update README.md
Browse files
README.md
CHANGED
@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
|
|
94 |
<caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
|
95 |
</figure>
|
96 |
|
97 |
-
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M
|
98 |
|
99 |
```python
|
100 |
from datasets import load_dataset
|
|
|
94 |
<caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
|
95 |
</figure>
|
96 |
|
97 |
+
Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
|
98 |
|
99 |
```python
|
100 |
from datasets import load_dataset
|