massivetext nit and a _blank
Browse files- index.html +6 -3
index.html
CHANGED
@@ -326,7 +326,7 @@
|
|
326 |
</li>
|
327 |
</ul>
|
328 |
<ul>
|
329 |
-
<li>Applied quality and repetition filters from
|
330 |
</li>
|
331 |
</ul>
|
332 |
<p>After applying this filtering to each of the text
|
@@ -577,7 +577,7 @@
|
|
577 |
minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
|
578 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
579 |
<p>The collected statistics ranged from common document-level
|
580 |
-
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (
|
581 |
inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
|
582 |
disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
|
583 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
@@ -604,7 +604,7 @@
|
|
604 |
</ul>
|
605 |
<ul>
|
606 |
<li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
|
607 |
-
(12.47% of tokens removed) — the original
|
608 |
</li>
|
609 |
</ul>
|
610 |
<ul>
|
@@ -767,5 +767,8 @@
|
|
767 |
toc.innerHTML = ToC;
|
768 |
toc.setAttribute('prerendered', 'true');
|
769 |
}
|
|
|
|
|
|
|
770 |
</script>
|
771 |
</body>
|
|
|
326 |
</li>
|
327 |
</ul>
|
328 |
<ul>
|
329 |
+
<li>Applied quality and repetition filters from MassiveText<d-cite bibtex-key="rae2022scaling"></d-cite> (using the default thresholds)
|
330 |
</li>
|
331 |
</ul>
|
332 |
<p>After applying this filtering to each of the text
|
|
|
577 |
minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
|
578 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
579 |
<p>The collected statistics ranged from common document-level
|
580 |
+
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
|
581 |
inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
|
582 |
disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
|
583 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
|
|
604 |
</ul>
|
605 |
<ul>
|
606 |
<li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
|
607 |
+
(12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
|
608 |
</li>
|
609 |
</ul>
|
610 |
<ul>
|
|
|
767 |
toc.innerHTML = ToC;
|
768 |
toc.setAttribute('prerendered', 'true');
|
769 |
}
|
770 |
+
document.querySelectorAll('a[href^="https://huggingface.co"]').forEach(function(link) {
|
771 |
+
link.setAttribute('target', '_blank');
|
772 |
+
});
|
773 |
</script>
|
774 |
</body>
|