hynky HF staff commited on
Commit
01b9161
Β·
1 Parent(s): 31eb363

stats section

Browse files
Files changed (1) hide show
  1. index.html +12 -8
index.html CHANGED
@@ -547,11 +547,11 @@
547
  the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
548
  the next section.</p>
549
  <h4>A statistical approach to develop heuristic filters</h4>
550
- <p>To come up with new possible filtering rules, we collected
551
- a very large list of statistics (statistical metrics) β€” over <strong>50</strong> β€” from different reference
552
- datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
553
- minhashed version and the result from the (worse quality) full dedup. This allowed us to compare the
554
- different datasets at a macro level, by looking at the distribution of these metrics for each one.</p>
555
  <p>The collected statistics ranged from common document-level
556
  metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
557
  inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
@@ -559,12 +559,16 @@
559
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
560
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
561
  indicating that the latter had higher inter-document repetition.</p>
562
- <p>Working under the assumption that these differences were
563
- caused by lower quality data on the full dedup version, we inspected histograms and manually defined
564
- thresholds for the metrics where these differences were starker. This process yielded 17 candidate
565
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
566
  <figure><img src="plots/Untitled%201.png"/></figure>
567
 
 
 
 
 
568
  <p>To assess the effectiveness of these newly created
569
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
570
  of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
 
547
  the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
548
  the next section.</p>
549
  <h4>A statistical approach to develop heuristic filters</h4>
550
+ <p>Due to our assumption that Full Minhash upsamples lower quality data in the oldest dumps, we were interested whether
551
+ we could find heuristic filters which would remove them. In order to find such filters
552
+ we collected a very large list of statistics (statistical metrics) β€” over <strong>50</strong> β€” on both the independently
553
+ minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
554
+ statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
555
  <p>The collected statistics ranged from common document-level
556
  metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
557
  inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
 
559
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
560
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
561
  indicating that the latter had higher inter-document repetition.</p>
562
+ <p>To choose the metrics for filtering, we computed Wasserstein distance between the two versions of 2013-48 crawl for all our metrics
563
+ and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
564
+ and filter the data and inspect the removed documents. This process yielded 17 candidate
565
  threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
566
  <figure><img src="plots/Untitled%201.png"/></figure>
567
 
568
+ <p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
569
+ We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
570
+ </p>
571
+
572
  <p>To assess the effectiveness of these newly created
573
  filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
574
  of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated