hynky HF Staff commited on
Commit
2a86960
·
1 Parent(s): 977719e

update index

Browse files
Files changed (1) hide show
  1. index.html +12 -5
index.html CHANGED
@@ -307,7 +307,6 @@
307
  <figure><img src="plots/wet_comparison.png"/></figure>
308
  <div id="plot-wet_comparison"></div>
309
  </div>
310
-
311
  <h3>Base filtering</h3>
312
  <p>Filtering is an important part of the curation process. It
313
  removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
@@ -359,8 +358,10 @@
359
  92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
360
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
361
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
362
- <figure><img src="plots/minhash_parameters_comparison.png"/>
363
- </figure>
 
 
364
  <p>While the high number of hash functions in RefinedWeb
365
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
366
  trade off.</p>
@@ -474,7 +475,10 @@
474
  <p>We then simulated uniformly sampling documents from this
475
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
476
  below you can see how often each document would be repeated.</p>
477
- <figure><img src="plots/dedup_impact_simulation.png"/></figure>
 
 
 
478
  <p>For 1B almost all documents would be unique
479
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
480
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -587,7 +591,10 @@
587
  and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
588
  and filter the data and inspect the removed documents. This process yielded 17 candidate
589
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
590
- <figure><img src="plots/Untitled%201.png"/></figure>
 
 
 
591
 
592
  <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
593
  We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
 
307
  <figure><img src="plots/wet_comparison.png"/></figure>
308
  <div id="plot-wet_comparison"></div>
309
  </div>
 
310
  <h3>Base filtering</h3>
311
  <p>Filtering is an important part of the curation process. It
312
  removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
 
358
  92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
359
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
360
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
361
+ <div class="main-plot-container">
362
+ <figure><img src="plots/minhash_params.png"/></figure>
363
+ <div id="plot-minhash_params"></div>
364
+ </div>
365
  <p>While the high number of hash functions in RefinedWeb
366
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
367
  trade off.</p>
 
475
  <p>We then simulated uniformly sampling documents from this
476
  entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
477
  below you can see how often each document would be repeated.</p>
478
+ <div class="main-plot-container">
479
+ <figure><img src="plots/duplicates_simul.png"/></figure>
480
+ <div id="plot-duplicates-simul"></div>
481
+ </div>
482
  <p>For 1B almost all documents would be unique
483
  (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
484
  dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
 
591
  and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
592
  and filter the data and inspect the removed documents. This process yielded 17 candidate
593
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
594
+ <div class="main-plot-container">
595
+ <figure><img src="plots/custom_filters.png"/></figure>
596
+ <div id="plot-stats"></div>
597
+ </div>
598
 
599
  <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
600
  We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).