Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on May 31, 2024

Commit

d8b9b7e

1 Parent(s): de35e29

massivetext nit and a _blank

Browse files

Files changed (1) hide show

index.html +6 -3

index.html CHANGED Viewed

@@ -326,7 +326,7 @@
         </li>
     </ul>
     <ul>
-        <li>Applied quality and repetition filters from the Gopher<d-cite bibtex-key="rae2022scaling"></d-cite> paper (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
@@ -577,7 +577,7 @@
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
-        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
@@ -604,7 +604,7 @@
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
-            (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
@@ -767,5 +767,8 @@
         toc.innerHTML = ToC;
         toc.setAttribute('prerendered', 'true');
     }
 </script>
 </body>

         </li>
     </ul>
     <ul>
+        <li>Applied quality and repetition filters from MassiveText<d-cite bibtex-key="rae2022scaling"></d-cite> (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
+        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
+            (12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
         toc.innerHTML = ToC;
         toc.setAttribute('prerendered', 'true');
     }
+    document.querySelectorAll('a[href^="https://huggingface.co"]').forEach(function(link) {
+        link.setAttribute('target', '_blank');
+    });
 </script>
 </body>