Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 30, 2024

Commit

ea35903

2 Parent(s): ffc056a 5385888

Merge branch 'main' of hf.co:spaces/HuggingFaceFW/blogpost

Browse files

Files changed (4) hide show

bibliography.bib +19 -0
index.html +55 -50
src/distill.js +0 -0
style.css +13 -0

bibliography.bib CHANGED Viewed

@@ -190,4 +190,23 @@
       eprint={2205.10487},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }

       eprint={2205.10487},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
+}
+@article{llama3modelcard,
+title={Llama 3 Model Card},
+author={AI@Meta},
+year={2024},
+url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
+}
+@misc{jiang2024mixtral,
+      title={Mixtral of Experts},
+      author={Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and Lélio Renard Lavaud and Lucile Saulnier and Marie-Anne Lachaux and Pierre Stock and Sandeep Subramanian and Sophia Yang and Szymon Antoniak and Teven Le Scao and Théophile Gervet and Thibaut Lavril and Thomas Wang and Timothée Lacroix and William El Sayed},
+      year={2024},
+      eprint={2401.04088},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
 }

index.html CHANGED Viewed

@@ -1,7 +1,7 @@
 <!doctype html>
 <head>
-    <script src="https://distill.pub/template.v2.js"></script>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
     <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
@@ -122,27 +122,19 @@
 <body>
 <d-front-matter>
     <script id='distill-front-matter' type="text/json">{
-    "title": "FineWeb: 15T tokens of high quality web data",
-    "description": "This blog covers the FineWeb recipe, why more deduplication is not always better and some interesting findings on the difference in quality of CommonCrawl dumps.",
     "published": "May 28, 2024",
     "authors": [
       {
         "author":"Guilherme Penedo",
-        "authorURL":"https://huggingface.co/guipenedo",
-        "affiliations": [{"name": "HuggingFace"}]
       },
       {
         "author":"Hynek Kydlíček",
         "authorURL":"https://huggingface.co/hynky"
       },
-      {
-        "author":"Leandro Werra",
-        "authorURL":"https://huggingface.co/lvwerra"
-      },
-      {
-        "author":"Thomas Wolf",
-        "authorURL":"https://huggingface.co/thomwolf"
-      },
       {
         "author":"Loubna Ben Allal",
         "authorURL":"https://huggingface.co/loubnabnl"
@@ -150,6 +142,18 @@
       {
         "author":"Anton Lozhkov",
         "authorURL":"https://huggingface.co/anton-l"
       }
     ],
     "katex": {
@@ -174,18 +178,18 @@
     </d-contents>
     <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
-        (<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
-    <p>[TODO: ADD MORE INTRODUCTION]</p>
-    <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
-    <p>As 🍷FineWeb has gathered a lot of interest from the
         community, we decided to further explain the steps involved in creating it, our processing decisions and
         some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
-    <p><strong>TLDR:</strong> This blog covers the FineWeb
-        recipe, why more deduplication is not always better and some interesting findings on the difference in
-        quality of CommonCrawl dumps.</p>
     <h2>General considerations on web data</h2>
     <h3>Sourcing the data</h3>
@@ -201,13 +205,13 @@
         <li>you use a public repository of crawled webpages, like the one maintained by
             the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
     </ul>
-    <p>For FineWeb, similarly to what was done for a large number
         of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
-        They have been crawling the web since 2007 (long before LLMs were a thing) and release a new dump usually
         every 1 or 2 months, which can be freely downloaded. </p>
-    <p>As an example, their latest crawl (2024-10) contains 3.16
-        billion web pages, totaling 424.7 TiB of uncompressed HTML text content (the size changes from dump to dump). There
-        are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
     <h3>Processing at scale</h3>
     <p>Given the sheer size of the data involved, one of the main
         challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
@@ -216,7 +220,7 @@
     <p>For this purpose, we developed <a
             href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
         processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
-        CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
                 href="https://github.com/huggingface/datatrove">library</a>.</p>
     <h3>What is clean, good data?</h3>
     <p>This is probably the main question to keep in mind when
@@ -324,7 +328,7 @@
         </li>
     </ul>
     <p>After applying this filtering to each of the text
-        extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
         tokenized with the <code>gpt2</code> tokenizer).</p>
     <h3>Deduplication</h3>
     <p>Deduplication is another important step, specially for web
@@ -361,7 +365,7 @@
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
     <p>Our initial approach was to take the entire dataset (all
-        95 dumps) and deduplicate them as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
         not only against itself but also by removing any matches with duplicates from the previously processed
@@ -485,18 +489,18 @@
         independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
-            (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
     </ul>
     <ul>
         <li>Line deduplication:
             <ul>
                 <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
-                    tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
             </ul>
             <ul>
                 <li>same as above, but only removing duplicate lines with at least 10
                     words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
-                    dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
             </ul>
             <ul>
                 <li>remove all but 1 occurrence of each span of 3 duplicated lines
@@ -529,7 +533,7 @@
         benchmark, one of the benchmarks in our “early signal” group with the stronger signal and highest
         signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
         the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
-        each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump:</p>
     <div class="main-plot-container">
         <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
         <div id="plot-c4_filters_hellaswag"></div>
@@ -614,7 +618,7 @@
         <div id="plot-custom-filters"></div>
     </div>
     <h2>The final dataset</h2>
-    <p>The final FineWeb dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
     <ul>
@@ -671,7 +675,7 @@
         <div id="plot-dataset_ablations"></div>
     </div>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
-        FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>📚 FineWeb-Edu</h2>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
@@ -679,33 +683,34 @@
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
     <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
-    <p>However, these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create FineWeb-Edu.</p>
     <h3>Annotation</h3>
-    <p>We used Llama3-70B-Instruct to annotate 500k samples from the FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
     <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
-        <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score</figcaption>
-    </div>
-    <p>We also experimented with different LLMs: <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a>, <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a>, and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>. Llama3 and Mixtral-8x22B produced similar scores, while Mixtral-8x7B tended to be more generous, not fully adhering to the score scale. <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a> suggest using multiple LLMs as juries. We tried averaging the scores from the three models, but this shifted the distribution to the right due to the higher scores from Mixtral-8x7B. Training on a dataset filtered with a classifier using jury annotations performed worse than using a classifier based on Llama3 annotations. We hypothesize that the jury-based approach retains more low-quality samples.</p>
-    <div style="text-align: center; margin: 20px 0;">
-        <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/dQskZA-4fsk8aR_8g9evJ.png" style="width: 80%; max-width: 700px; height: auto;"></figure>
     </div>
     <h3>Classifier Training</h3>
-    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our validation set. After training, we rounded the scores to integers from 0 to 5. This approach resulted in the model achieving an F1 score of 82%, indicating robust performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
-    <h3>Filtering</h3>
-    <p>We applied the classifier to the 15T tokens of FineWeb, a process that required 6,000 H100 GPU hours. To build FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
-    <ul>
-        <li>FineWeb-Edu surpasses FineWeb and all other web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
         <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
-        <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
-    <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
-    <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
-    <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>

 <!doctype html>
 <head>
+    <script src="src/distill.js"></script>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
     <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
 <body>
 <d-front-matter>
     <script id='distill-front-matter' type="text/json">{
+    "title": "🍷 FineWeb: decanting the web for the finest text data at scale",
+    "description": "This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create 📚 FineWeb-Edu.",
     "published": "May 28, 2024",
+    "affiliation": {"name": "HuggingFace"},
     "authors": [
       {
         "author":"Guilherme Penedo",
+        "authorURL":"https://huggingface.co/guipenedo"
       },
       {
         "author":"Hynek Kydlíček",
         "authorURL":"https://huggingface.co/hynky"
       },
       {
         "author":"Loubna Ben Allal",
         "authorURL":"https://huggingface.co/loubnabnl"
       {
         "author":"Anton Lozhkov",
         "authorURL":"https://huggingface.co/anton-l"
+      },
+      {
+        "author":"Colin Raffel",
+        "authorURL":"https://huggingface.co/craffel"
+      },
+      {
+        "author":"Leandro Werra",
+        "authorURL":"https://huggingface.co/lvwerra"
+      },
+      {
+        "author":"Thomas Wolf",
+        "authorURL":"https://huggingface.co/thomwolf"
       }
     ],
     "katex": {
     </d-contents>
     <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
+        (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
+    <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
+    <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
+    <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes: <strong>1.3 trillion (very high quality) and 5.4 trillion (high quality) tokens</strong>. 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
+    <p>As 🍷 FineWeb has gathered a lot of interest from the
         community, we decided to further explain the steps involved in creating it, our processing decisions and
         some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
+    <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
+        recipe (listing and explaining all of our design choices), and the process followed to create 📚 FineWeb-Edu.</p>
     <h2>General considerations on web data</h2>
     <h3>Sourcing the data</h3>
         <li>you use a public repository of crawled webpages, like the one maintained by
             the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
     </ul>
+    <p>For 🍷 FineWeb, similarly to what was done for a large number
         of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
+        They have been crawling the web since 2007 (long before LLMs became widespread) and release a new dump usually
         every 1 or 2 months, which can be freely downloaded. </p>
+    <p>As an example, their latest crawl (2024-18) contains 2.7
+        billion web pages, totaling 386 TiB of uncompressed HTML text content (the size changes from dump to dump). There
+        are 96 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
     <h3>Processing at scale</h3>
     <p>Given the sheer size of the data involved, one of the main
         challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
     <p>For this purpose, we developed <a
             href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
         processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
+        CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
                 href="https://github.com/huggingface/datatrove">library</a>.</p>
     <h3>What is clean, good data?</h3>
     <p>This is probably the main question to keep in mind when
         </li>
     </ul>
     <p>After applying this filtering to each of the text
+        extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data (when
         tokenized with the <code>gpt2</code> tokenizer).</p>
     <h3>Deduplication</h3>
     <p>Deduplication is another important step, specially for web
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
     <p>Our initial approach was to take the entire dataset (all
+        96 dumps) and deduplicate them as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
         not only against itself but also by removing any matches with duplicates from the previously processed
         independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
+            (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>🍷 FineWeb URL dedup</em></li>
     </ul>
     <ul>
         <li>Line deduplication:
             <ul>
                 <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
+                    tokens dropped, 4.4T left) — <em>🍷 FineWeb line dedup</em></li>
             </ul>
             <ul>
                 <li>same as above, but only removing duplicate lines with at least 10
                     words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
+                    dropped, 2.9T left) — <em>🍷 FineWeb line dedup w/ min words</em></li>
             </ul>
             <ul>
                 <li>remove all but 1 occurrence of each span of 3 duplicated lines
         benchmark, one of the benchmarks in our “early signal” group with the stronger signal and highest
         signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
         the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
+        each of the different filters used in C4 to a baseline of the independently deduped 🍷 FineWeb 2019-18 dump:</p>
     <div class="main-plot-container">
         <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
         <div id="plot-c4_filters_hellaswag"></div>
         <div id="plot-custom-filters"></div>
     </div>
     <h2>The final dataset</h2>
+    <p>The final 🍷 FineWeb dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
     <ul>
         <div id="plot-dataset_ablations"></div>
     </div>
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
+        🍷 FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>📚 FineWeb-Edu</h2>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
     <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
+    <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create 📚 FineWeb-Edu.</p>
     <h3>Annotation</h3>
+    <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
     <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
+        <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
+    <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
+    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 50,000 samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
+    <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
+    <h3>Filtering and results</h3>
+    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
+    <p><strong>TODO: add the plot</strong></p>
+    <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
+    <p><strong>TODO: add the plot</strong></p>
+    <p>Here are the key highlights of the ablation results above:</p>
+    <ul>
+        <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
         <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
+    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
+    <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
+    <p><strong>TODO: add dataset links and a collection</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>

src/distill.js ADDED Viewed

The diff for this file is too large to render. See raw diff

style.css CHANGED Viewed

@@ -120,3 +120,16 @@
         display: flex !important;
     }
 }

         display: flex !important;
     }
 }
+d-byline .byline {
+  grid-template-columns: 1fr;
+  grid-column: text;
+  font-size: 0.9rem;
+  line-height: 1.8em;
+}
+@media (min-width: 768px) {
+  d-byline .byline {
+    grid-template-columns: 5fr 1fr 1fr;
+  }
+}