guipenedo HF Staff commited on
Commit
d8b9b7e
·
1 Parent(s): de35e29

massivetext nit and a _blank

Browse files
Files changed (1) hide show
  1. index.html +6 -3
index.html CHANGED
@@ -326,7 +326,7 @@
326
  </li>
327
  </ul>
328
  <ul>
329
- <li>Applied quality and repetition filters from the Gopher<d-cite bibtex-key="rae2022scaling"></d-cite> paper (using the default thresholds)
330
  </li>
331
  </ul>
332
  <p>After applying this filtering to each of the text
@@ -577,7 +577,7 @@
577
  minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
578
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
579
  <p>The collected statistics ranged from common document-level
580
- metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
581
  inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
582
  disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
583
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
@@ -604,7 +604,7 @@
604
  </ul>
605
  <ul>
606
  <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
607
- (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
608
  </li>
609
  </ul>
610
  <ul>
@@ -767,5 +767,8 @@
767
  toc.innerHTML = ToC;
768
  toc.setAttribute('prerendered', 'true');
769
  }
 
 
 
770
  </script>
771
  </body>
 
326
  </li>
327
  </ul>
328
  <ul>
329
+ <li>Applied quality and repetition filters from MassiveText<d-cite bibtex-key="rae2022scaling"></d-cite> (using the default thresholds)
330
  </li>
331
  </ul>
332
  <p>After applying this filtering to each of the text
 
577
  minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
578
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
579
  <p>The collected statistics ranged from common document-level
580
+ metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
581
  inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
582
  disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
583
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
 
604
  </ul>
605
  <ul>
606
  <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
607
+ (12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
608
  </li>
609
  </ul>
610
  <ul>
 
767
  toc.innerHTML = ToC;
768
  toc.setAttribute('prerendered', 'true');
769
  }
770
+ document.querySelectorAll('a[href^="https://huggingface.co"]').forEach(function(link) {
771
+ link.setAttribute('target', '_blank');
772
+ });
773
  </script>
774
  </body>