Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 29, 2024

Commit

01b9161

1 Parent(s): 31eb363

stats section

Browse files

Files changed (1) hide show

index.html +12 -8

index.html CHANGED Viewed

@@ -547,11 +547,11 @@
         the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
         the next section.</p>
     <h4>A statistical approach to develop heuristic filters</h4>
-    <p>To come up with new possible filtering rules, we collected
-        a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference
-        datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
-        minhashed version and the result from the (worse quality) full dedup. This allowed us to compare the
-        different datasets at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
         metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
@@ -559,12 +559,16 @@
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
         (0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
         indicating that the latter had higher inter-document repetition.</p>
-    <p>Working under the assumption that these differences were
-        caused by lower quality data on the full dedup version, we inspected histograms and manually defined
-        thresholds for the metrics where these differences were starker. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
     <figure><img src="plots/Untitled%201.png"/></figure>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
         of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated

         the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
         the next section.</p>
     <h4>A statistical approach to develop heuristic filters</h4>
+    <p>Due to our assumption that Full Minhash upsamples lower quality data in the oldest dumps, we were interested whether
+        we could find heuristic filters which would remove them. In order to find such filters
+        we collected a very large list of statistics (statistical metrics) — over <strong>50</strong> — on both the independently
+        minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
+        statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
         metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
         (0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
         indicating that the latter had higher inter-document repetition.</p>
+    <p>To choose the metrics for filtering, we computed Wasserstein distance between the two versions of 2013-48 crawl for all our metrics
+        and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
+        and filter the data and inspect the removed documents. This process yielded 17 candidate
         threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
     <figure><img src="plots/Untitled%201.png"/></figure>
+    <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
+        We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
+    </p>
     <p>To assess the effectiveness of these newly created
         filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
         of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated