bertin-project
/

bertin-roberta-base-spanish

Model card Files Files and versions

Metrics Training metrics Community

Pablogps commited on Jul 23, 2021

Commit

dd37bca

·

1 Parent(s): 4382b95

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
-Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. Since the `validation` set was too small to extract a 10% (5M) of the samples using perplexity-sampling with the same `factor` and `width`, in our experiments we decided to sample from the training sets. In the `bertin-project/mc4-es-sampled dataset`, the validation set pulls the samples from the original `mc4`.
 ```python
 from datasets import load_dataset

 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
+Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M elements from the original train split. However, when these parameters are applied to the validation split they result in too few examples, so we created a split from our own train dataset instead. Therefore, in `bertin-project/mc4-es-sampled dataset` train splits contain ~45M examples, while there are ~5M for validation.
 ```python
 from datasets import load_dataset