Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

loubnabnl HF Staff commited on May 31, 2024

Commit

23b94b1

2 Parent(s): 467b7af d209f57

Merge branch 'main' of https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

Browse files

Files changed (1) hide show

index.html +5 -4

index.html CHANGED Viewed

@@ -11,6 +11,7 @@
     <link rel="stylesheet" href="style.css">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
     <title>FineWeb: 15T tokens of high quality web data</title>
     <style>
@@ -326,7 +327,7 @@
         </li>
     </ul>
     <ul>
-        <li>Applied quality and repetition filters from the Gopher<d-cite bibtex-key="rae2022scaling"></d-cite> paper (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
@@ -577,7 +578,7 @@
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
-        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
@@ -604,7 +605,7 @@
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
-            (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
@@ -742,7 +743,7 @@
             const isException = el.getAttribute('no-toc');
             if (isInTitle || isException) continue;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
-            const link = '<a href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
             const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {

     <link rel="stylesheet" href="style.css">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
+    <base target="_blank">
     <title>FineWeb: 15T tokens of high quality web data</title>
     <style>
         </li>
     </ul>
     <ul>
+        <li>Applied quality and repetition filters from MassiveText<d-cite bibtex-key="rae2022scaling"></d-cite> (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
+        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
+            (12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
             const isException = el.getAttribute('no-toc');
             if (isInTitle || isException) continue;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
+            const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
             const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {