Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 31, 2024

Commit

2a86960

1 Parent(s): 977719e

update index

Browse files

Files changed (1) hide show

index.html +12 -5

index.html CHANGED Viewed

@@ -307,7 +307,6 @@
         <figure><img src="plots/wet_comparison.png"/></figure>
         <div id="plot-wet_comparison"></div>
     </div>
     <h3>Base filtering</h3>
     <p>Filtering is an important part of the curation process. It
         removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
@@ -359,8 +358,10 @@
         92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
-    <figure><img src="plots/minhash_parameters_comparison.png"/>
-    </figure>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
         trade off.</p>
@@ -474,7 +475,10 @@
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
-    <figure><img src="plots/dedup_impact_simulation.png"/></figure>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
@@ -587,7 +591,10 @@
         and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
         and filter the data and inspect the removed documents. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
-    <figure><img src="plots/Untitled%201.png"/></figure>
     <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
         We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).

         <figure><img src="plots/wet_comparison.png"/></figure>
         <div id="plot-wet_comparison"></div>
     </div>
     <h3>Base filtering</h3>
     <p>Filtering is an important part of the curation process. It
         removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
         92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
+    <div class="main-plot-container">
+        <figure><img src="plots/minhash_params.png"/></figure>
+        <div id="plot-minhash_params"></div>
+    </div>
     <p>While the high number of hash functions in RefinedWeb
         allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
         trade off.</p>
     <p>We then simulated uniformly sampling documents from this
         entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
         below you can see how often each document would be repeated.</p>
+    <div class="main-plot-container">
+        <figure><img src="plots/duplicates_simul.png"/></figure>
+        <div id="plot-duplicates-simul"></div>
+    </div>
     <p>For 1B almost all documents would be unique
         (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
         dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
         and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
         and filter the data and inspect the removed documents. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
+    <div class="main-plot-container">
+        <figure><img src="plots/custom_filters.png"/></figure>
+        <div id="plot-stats"></div>
+    </div>
     <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
         We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).