Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 31, 2024

Commit

ffb95ea

2 Parent(s): 2a86960 96ee97e

Merge branch 'main' of hf.co:spaces/HuggingFaceFW/blogpost-fineweb-v1

Browse files

Files changed (3) hide show

bibliography.bib +25 -0
index.html +42 -19
style.css +4 -0

bibliography.bib CHANGED Viewed

@@ -209,4 +209,29 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       eprint={2401.04088},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }

       eprint={2401.04088},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
+}
+@article{yuan2024self,
+  title={Self-rewarding language models},
+  author={Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason},
+  journal={arXiv preprint arXiv:2401.10020},
+  year={2024}
+}
+@article{verga2024replacing,
+  title={Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models},
+  author={Verga, Pat and Hofstatter, Sebastian and Althammer, Sophia and Su, Yixuan and Piktus, Aleksandra and Arkhangorodsky, Arkady and Xu, Minjie and White, Naomi and Lewis, Patrick},
+  journal={arXiv preprint arXiv:2404.18796},
+  year={2024}
+}
+@article{abdin2024phi,
+  title={Phi-3 technical report: A highly capable language model locally on your phone},
+  author={Abdin, Marah and Jacobs, Sam Ade and Awan, Ammar Ahmad and Aneja, Jyoti and Awadallah, Ahmed and Awadalla, Hany and Bach, Nguyen and Bahree, Amit and Bakhtiari, Arash and Behl, Harkirat and others},
+  journal={arXiv preprint arXiv:2404.14219},
+  year={2024}
+}
+@misc{meta2024responsible,
+  title = {Our responsible approach to Meta AI and Meta Llama 3},
+  author = {Meta},
+  year = {2024},
+  url = {https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/},
+  note = {Accessed: 2024-05-31}
 }

index.html CHANGED Viewed

@@ -11,6 +11,7 @@
     <link rel="stylesheet" href="style.css">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
     <title>FineWeb: 15T tokens of high quality web data</title>
     <style>
@@ -325,7 +326,7 @@
         </li>
     </ul>
     <ul>
-        <li>Applied quality and repetition filters from the Gopher<d-cite bibtex-key="rae2022scaling"></d-cite> paper (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
@@ -581,7 +582,7 @@
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
-        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
@@ -611,7 +612,7 @@
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
-            (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
@@ -684,33 +685,33 @@
         <div id="plot-dataset_ablations"></div>
     </div>
     <h2>📚 FineWeb-Edu</h2>
-    <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
-    <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
-    <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
     <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
-    <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create 📚 FineWeb-Edu.</p>
     <h3>Annotation</h3>
-    <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
-    <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
-    <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
-    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of ~47k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
-    <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <h3>Filtering and results</h3>
-    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-8k.png">
         </figure>
         <div id="plot-edu-8k"></div>
     </div>
-    <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-100k.png">
@@ -720,12 +721,11 @@
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
-        <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
-    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
-    <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
-    <p><strong>TODO: add dataset links and a collection</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>
@@ -750,7 +750,7 @@
             const isException = el.getAttribute('no-toc');
             if (isInTitle || isException) continue;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
-            const link = '<a href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
             const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
@@ -774,6 +774,29 @@
         ToC += '</nav>';
         toc.innerHTML = ToC;
         toc.setAttribute('prerendered', 'true');
     }
 </script>
 </body>

     <link rel="stylesheet" href="style.css">
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <meta charset="utf8">
+    <base target="_blank">
     <title>FineWeb: 15T tokens of high quality web data</title>
     <style>
         </li>
     </ul>
     <ul>
+        <li>Applied quality and repetition filters from MassiveText<d-cite bibtex-key="rae2022scaling"></d-cite> (using the default thresholds)
         </li>
     </ul>
     <p>After applying this filtering to each of the text
         minhashed version and the result from the (worse quality) full dedup from 2013-48 and 2015-22 crawls (older crawls). We then compared the
         statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
     <p>The collected statistics ranged from common document-level
+        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
         inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
         disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
         metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
     </ul>
     <ul>
         <li>Remove documents where the fraction of characters in duplicated lines ≥ 0.1
+            (12.47% of tokens removed) — the original MassiveText threshold for this ratio is ≥ 0.2
         </li>
     </ul>
     <ul>
         <div id="plot-dataset_ablations"></div>
     </div>
     <h2>📚 FineWeb-Edu</h2>
+    <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
+    <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
+    <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
     <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
+    <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create 📚 FineWeb-Edu.</p>
     <h3>Annotation</h3>
+    <p>We used Llama-3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
+    <p>We explored various prompts and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
+    <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
+    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
+    <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <h3>Filtering and results</h3>
+    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-8k.png">
         </figure>
         <div id="plot-edu-8k"></div>
     </div>
+    <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3 trillion educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-100k.png">
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
+        <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
+    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
+    <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>
             const isException = el.getAttribute('no-toc');
             if (isInTitle || isException) continue;
             el.setAttribute('id', el.textContent.toLowerCase().replaceAll(" ", "_"))
+            const link = '<a target="_self" href="' + '#' + el.getAttribute('id') + '">' + el.textContent + '</a>';
             const level = el.tagName === 'H2' ? 0 : (el.tagName === 'H3' ? 1 : 2);
             while (prevLevel < level) {
         ToC += '</nav>';
         toc.innerHTML = ToC;
         toc.setAttribute('prerendered', 'true');
+        const toc_links = document.querySelectorAll('d-contents > nav a');
+        window.addEventListener('scroll', (_event) => {
+            if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
+                // Then iterate forwards, on the first match highlight it and break
+                find_active: {
+                    for (let i = headings.length - 1; i >= 0; i--) {
+                        if (headings[i].getBoundingClientRect().top - 50 <= 0) {
+                            if (!toc_links[i].classList.contains("active")) {
+                                toc_links.forEach((link, _index) => {
+                                    link.classList.remove("active");
+                                });
+                                toc_links[i].classList.add('active');
+                            }
+                            break find_active;
+                        }
+                    }
+                    toc_links.forEach((link, _index) => {
+                        link.classList.remove("active");
+                    });
+                }
+            }
+        });
     }
 </script>
 </body>

style.css CHANGED Viewed

@@ -137,3 +137,7 @@ d-byline .byline {
 #title-plot {
     margin-top: 0px;
 }

 #title-plot {
     margin-top: 0px;
 }
+d-contents > nav a.active {
+    text-decoration: underline;
+}