update index
Browse files- index.html +12 -5
index.html
CHANGED
@@ -307,7 +307,6 @@
|
|
307 |
<figure><img src="plots/wet_comparison.png"/></figure>
|
308 |
<div id="plot-wet_comparison"></div>
|
309 |
</div>
|
310 |
-
|
311 |
<h3>Base filtering</h3>
|
312 |
<p>Filtering is an important part of the curation process. It
|
313 |
removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
|
@@ -359,8 +358,10 @@
|
|
359 |
92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
|
360 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
361 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
362 |
-
<
|
363 |
-
|
|
|
|
|
364 |
<p>While the high number of hash functions in RefinedWeb
|
365 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
366 |
trade off.</p>
|
@@ -474,7 +475,10 @@
|
|
474 |
<p>We then simulated uniformly sampling documents from this
|
475 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
476 |
below you can see how often each document would be repeated.</p>
|
477 |
-
<
|
|
|
|
|
|
|
478 |
<p>For 1B almost all documents would be unique
|
479 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
480 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
@@ -587,7 +591,10 @@
|
|
587 |
and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
|
588 |
and filter the data and inspect the removed documents. This process yielded 17 candidate
|
589 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
590 |
-
<
|
|
|
|
|
|
|
591 |
|
592 |
<p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
|
593 |
We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
|
|
|
307 |
<figure><img src="plots/wet_comparison.png"/></figure>
|
308 |
<div id="plot-wet_comparison"></div>
|
309 |
</div>
|
|
|
310 |
<h3>Base filtering</h3>
|
311 |
<p>Filtering is an important part of the curation process. It
|
312 |
removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
|
|
|
358 |
92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
|
359 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
360 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
361 |
+
<div class="main-plot-container">
|
362 |
+
<figure><img src="plots/minhash_params.png"/></figure>
|
363 |
+
<div id="plot-minhash_params"></div>
|
364 |
+
</div>
|
365 |
<p>While the high number of hash functions in RefinedWeb
|
366 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
367 |
trade off.</p>
|
|
|
475 |
<p>We then simulated uniformly sampling documents from this
|
476 |
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
|
477 |
below you can see how often each document would be repeated.</p>
|
478 |
+
<div class="main-plot-container">
|
479 |
+
<figure><img src="plots/duplicates_simul.png"/></figure>
|
480 |
+
<div id="plot-duplicates-simul"></div>
|
481 |
+
</div>
|
482 |
<p>For 1B almost all documents would be unique
|
483 |
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
|
484 |
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
|
|
|
591 |
and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
|
592 |
and filter the data and inspect the removed documents. This process yielded 17 candidate
|
593 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
594 |
+
<div class="main-plot-container">
|
595 |
+
<figure><img src="plots/custom_filters.png"/></figure>
|
596 |
+
<div id="plot-stats"></div>
|
597 |
+
</div>
|
598 |
|
599 |
<p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
|
600 |
We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
|