Pablogps commited on
Commit
f37d879
·
1 Parent(s): dd37bca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
- Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M elements from the original train split. However, when these parameters are applied to the validation split they result in too few examples, so we created a split from our own train dataset instead. Therefore, in `bertin-project/mc4-es-sampled dataset` train splits contain ~45M examples, while there are ~5M for validation.
98
 
99
  ```python
100
  from datasets import load_dataset
 
94
  <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
95
  </figure>
96
 
97
+ Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
98
 
99
  ```python
100
  from datasets import load_dataset