Pablogps commited on
Commit
dd37bca
·
1 Parent(s): 4382b95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
- Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. Since the `validation` set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same `factor` and `width`, in our experiments we decided to sample from the training sets. In the `bertin-project/mc4-es-sampled dataset`, the validation set pulls the samples from the original `mc4`.
98
 
99
  ```python
100
  from datasets import load_dataset
 
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
+ Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M elements from the original train split. However, when these parameters are applied to the validation split they result in too few examples, so we created a split from our own train dataset instead. Therefore, in `bertin-project/mc4-es-sampled dataset` train splits contain ~45M examples, while there are ~5M for validation.
98
 
99
  ```python
100
  from datasets import load_dataset