bertin-project
/

bertin-roberta-base-spanish

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Pablogps commited on Jul 23, 2021

Commit

f37d879

·

1 Parent(s): dd37bca

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -94,7 +94,7 @@ We adjusted the `factor` parameter of the `Stepwise` function, and the `factor`
 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
-Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M elements from the original train split. However, when these parameters are applied to the validation split they result in too few examples, so we created a split from our own train dataset instead. Therefore, in `bertin-project/mc4-es-sampled dataset` train splits contain ~45M examples, while there are ~5M for validation.
 ```python
 from datasets import load_dataset

 <caption>Figure 4. Expected perplexity distributions of the sample mc4-es after applying Gaussian function.</caption>
 </figure>
+Figure 5 shows the actual perplexity distributions of the generated 50M subsets for each of the executed subsampling procedures. All subsets can be easily accessed for reproducibility purposes using the `bertin-project/mc4-es-sampled` dataset. We adjusted our subsampling parameters so that we would sample ~50M examples from the original train split in mC4. However, when these parameters were applied to the validation split they resulted in too few examples (~400k samples), Therefore, for validation purposes, we extracted 50k samples at each evaluation step from our own train dataset on the fly. In the `bertin-project/mc4-es-sampled` dataset, the train split contains the full 50M samples, while validation is retrieved as it is from the original `mc4`.
 ```python
 from datasets import load_dataset