stats section
Browse files- index.html +12 -8
index.html
CHANGED
@@ -547,11 +547,11 @@
|
|
547 |
the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
|
548 |
the next section.</p>
|
549 |
<h4>A statistical approach to develop heuristic filters</h4>
|
550 |
-
<p>
|
551 |
-
|
552 |
-
|
553 |
-
minhashed version and the result from the (worse quality) full dedup
|
554 |
-
|
555 |
<p>The collected statistics ranged from common document-level
|
556 |
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
|
557 |
inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
|
@@ -559,12 +559,16 @@
|
|
559 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
560 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
561 |
indicating that the latter had higher inter-document repetition.</p>
|
562 |
-
<p>
|
563 |
-
|
564 |
-
|
565 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
566 |
<figure><img src="plots/Untitled%201.png"/></figure>
|
567 |
|
|
|
|
|
|
|
|
|
568 |
<p>To assess the effectiveness of these newly created
|
569 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
570 |
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|
|
|
547 |
the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
|
548 |
the next section.</p>
|
549 |
<h4>A statistical approach to develop heuristic filters</h4>
|
550 |
+
<p>Due to our assumption that Full Minhash upsamples lower quality data in the oldest dumps, we were interested whether
|
551 |
+
we could find heuristic filters which would remove them. In order to find such filters
|
552 |
+
we collected a very large list of statistics (statistical metrics) β over <strong>50</strong> β on both the independently
|
553 |
+
minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
|
554 |
+
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
555 |
<p>The collected statistics ranged from common document-level
|
556 |
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
|
557 |
inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
|
|
|
559 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
560 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
561 |
indicating that the latter had higher inter-document repetition.</p>
|
562 |
+
<p>To choose the metrics for filtering, we computed Wasserstein distance between the two versions of 2013-48 crawl for all our metrics
|
563 |
+
and then select the ones with the heighest distance. We would then inspect the histograms, empirically choose a threshold
|
564 |
+
and filter the data and inspect the removed documents. This process yielded 17 candidate
|
565 |
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
|
566 |
<figure><img src="plots/Untitled%201.png"/></figure>
|
567 |
|
568 |
+
<p>As an example, we inspected the histograms of Fraction of lines ending with punctuation metric (see the image above) and observed the increased document density of Full Minhash at around 0.12 ratio.
|
569 |
+
We then filtered with this threshold and found out that the removed data had a higher amount of short lists or consisted of only document layout text (Home Sign up etc...).
|
570 |
+
</p>
|
571 |
+
|
572 |
<p>To assess the effectiveness of these newly created
|
573 |
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
|
574 |
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|