thomwolf HF staff commited on
Commit
4e5005c
·
1 Parent(s): 48424a8
Files changed (2) hide show
  1. dist/index.html +42 -32
  2. src/index.html +42 -32
dist/index.html CHANGED
@@ -88,8 +88,8 @@
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
- <h2>General considerations on web data</h2>
92
- <h3>Sourcing the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
@@ -222,7 +222,7 @@
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
- <h3>Base filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
  removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
@@ -245,7 +245,7 @@
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
  extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
248
- <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
@@ -431,7 +431,7 @@
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
- <h3>Additional filtering</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
@@ -496,8 +496,8 @@
496
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
497
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
498
  indicating that the latter had higher inter-document repetition.</p>
499
- <p>Following the process listed above for these datasets yielded 17 candidate
500
- metric-threshold pairs. In the image below, you can see 3 of these histograms:</p>
501
  <div class="main-plot-container">
502
  <figure><img src="assets/images/stats.png"/></figure>
503
  <div id="plot-stats"></div>
@@ -507,9 +507,9 @@
507
  We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
508
  </p>
509
 
510
- <p>We then assessed the effectiveness of these 17 newly created
511
- filters, by conducting <strong>28B tokens</strong> ablation runs on the <strong>2019-18 crawl</strong>. Out
512
- of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
513
  the most significant improvements on the aggregate score:</p>
514
  <ul>
515
  <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
@@ -527,15 +527,16 @@
527
  </li>
528
  </ul>
529
  <ul>
530
- <li>When applying the 3 together, ~22% of tokens were removed.</li>
531
  </ul>
532
  <div class="main-plot-container">
533
  <figure><img src="assets/images/custom_filters.png"/></figure>
534
  <div id="plot-custom_filters"></div>
535
  </div>
536
- <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance.</p>
 
537
  <h2>The final dataset</h2>
538
- <p>The final 🍷 FineWeb dataset comprises 15T tokens and
539
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
540
  of benchmark tasks:</p>
541
  <ul>
@@ -554,35 +555,40 @@
554
  <figure><img src="assets/images/filtering_steps.png"/></figure>
555
  <div id="plot-all_filtering_steps"></div>
556
  </div>
557
- <p>We compared 🍷 FineWeb with the following datasets:</p>
 
558
  <ul>
559
  <li><a
560
- href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a><d-cite bibtex-key="penedo2023refinedweb"></d-cite>
561
  </li>
562
  </ul>
563
  <ul>
564
- <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a><d-cite bibtex-key="raffel2023exploring"></d-cite></li>
565
  </ul>
566
  <ul>
567
- <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
568
- CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite>
569
  </li>
570
  </ul>
571
  <ul>
572
- <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> <d-cite bibtex-key="gao2020pile"></d-cite></li>
573
  </ul>
574
  <ul>
575
  <li><a
576
- href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
577
  </li>
578
  </ul>
579
  <ul>
580
  <li><a
581
- href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> <d-cite bibtex-key="together2023redpajama"></d-cite>
582
  (deduplicated)
583
  </li>
584
  </ul>
585
- <p>You will find these models on <a
 
 
 
 
586
  href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
587
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
588
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
@@ -591,28 +597,32 @@
591
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
592
  <div id="plot-dataset_ablations"></div>
593
  </div>
594
- <p>Large language models pretrained on 🍷 FineWeb, the largest publicly available clean LLM pretraining dataset, are better-performing than other open pretraining datasets.</p>
 
595
  <h2>📚 FineWeb-Edu</h2>
596
- <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
597
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
598
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
599
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
600
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
601
  <p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
602
- <h3>Annotation</h3>
603
- <p>We used Llama-3-70B-Instruct to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
604
- <p>We explored various prompts and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
 
605
  <div style="text-align: center; margin: 20px 0;">
606
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
607
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
608
  </div>
609
- <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
610
- <h3>Classifier Training</h3>
611
- <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
612
- <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
 
613
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
 
614
  <h3>Filtering and results</h3>
615
- <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
616
  <div class="main-plot-container">
617
  <figure>
618
  <img src="assets/images/edu-8k.png">
 
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
+ <h2>What's web data</h2>
92
+ <h3>Finding the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
 
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
+ <h3>First steps of filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
  removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
 
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
  extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
248
+ <h3>Deduplicating the data</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
 
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
+ <h3>Filtering the data even more for quality</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
 
496
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
497
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
498
  indicating that the latter had higher inter-document repetition.</p>
499
+ <p>Following the process listed above for these datasets yielded <strong>seventeen</strong> candidate
500
+ metric-threshold pairs. In the image below, you can see three of these histograms:</p>
501
  <div class="main-plot-container">
502
  <figure><img src="assets/images/stats.png"/></figure>
503
  <div id="plot-stats"></div>
 
507
  We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
508
  </p>
509
 
510
+ <p>We then assessed the effectiveness of these seventeen newly created
511
+ filters, by conducting several of our <em>28 billion tokens</em> ablation runs on the <em>2019-18 crawl</em>. Out
512
+ of all those runs, we identified <strong>three</strong> filters (the ones based on the histograms above) that demonstrated
513
  the most significant improvements on the aggregate score:</p>
514
  <ul>
515
  <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
 
527
  </li>
528
  </ul>
529
  <ul>
530
+ <li>When applying the three together, ~22% of tokens were removed.</li>
531
  </ul>
532
  <div class="main-plot-container">
533
  <figure><img src="assets/images/custom_filters.png"/></figure>
534
  <div id="plot-custom_filters"></div>
535
  </div>
536
+ <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
+
538
  <h2>The final dataset</h2>
539
+ <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
542
  <ul>
 
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
+ <h3>Comparisons with other web-scale datasets</h3>
559
+ <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
+ href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
563
  </li>
564
  </ul>
565
  <ul>
566
+ <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a> (172B tokens)<d-cite bibtex-key="raffel2023exploring"></d-cite></li>
567
  </ul>
568
  <ul>
569
+ <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (3T tokens) (the
570
+ CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite> <d-footnote>There is a newer version of Dolma, v1.7, which is smaller</d-footnote>
571
  </li>
572
  </ul>
573
  <ul>
574
+ <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> (340B tokens) <d-cite bibtex-key="gao2020pile"></d-cite></li>
575
  </ul>
576
  <ul>
577
  <li><a
578
+ href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> (627B tokens) <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
579
  </li>
580
  </ul>
581
  <ul>
582
  <li><a
583
+ href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> (20T tokens) <d-cite bibtex-key="together2023redpajama"></d-cite>
584
  (deduplicated)
585
  </li>
586
  </ul>
587
+ <ul>
588
+ <li> and our new <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> (15T tokens) (this report)
589
+ </li>
590
+ </ul>
591
+ <p>You will find the 350B-tokens-trained ablation models openly accessible and gathered in <a
592
  href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
593
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
594
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
 
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
+ <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
601
+
602
  <h2>📚 FineWeb-Edu</h2>
603
+ <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
604
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
605
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
606
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
607
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
608
  <p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
609
+
610
+ <h3>Annotating for educational quality at scale</h3>
611
+ <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
612
+ <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
613
  <div style="text-align: center; margin: 20px 0;">
614
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
615
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
616
  </div>
617
+ <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
618
+
619
+ <h3>Training a classifier</h3>
620
+ <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
621
+ <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
622
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
623
+
624
  <h3>Filtering and results</h3>
625
+ <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that using a threshold of <code>3</code> gave the best overall results. Although using a threshold higher than <code>3</code> improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
626
  <div class="main-plot-container">
627
  <figure>
628
  <img src="assets/images/edu-8k.png">
src/index.html CHANGED
@@ -88,8 +88,8 @@
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
- <h2>General considerations on web data</h2>
92
- <h3>Sourcing the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
@@ -222,7 +222,7 @@
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
- <h3>Base filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
  removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
@@ -245,7 +245,7 @@
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
  extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
248
- <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
@@ -431,7 +431,7 @@
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
- <h3>Additional filtering</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
@@ -496,8 +496,8 @@
496
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
497
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
498
  indicating that the latter had higher inter-document repetition.</p>
499
- <p>Following the process listed above for these datasets yielded 17 candidate
500
- metric-threshold pairs. In the image below, you can see 3 of these histograms:</p>
501
  <div class="main-plot-container">
502
  <figure><img src="assets/images/stats.png"/></figure>
503
  <div id="plot-stats"></div>
@@ -507,9 +507,9 @@
507
  We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
508
  </p>
509
 
510
- <p>We then assessed the effectiveness of these 17 newly created
511
- filters, by conducting <strong>28B tokens</strong> ablation runs on the <strong>2019-18 crawl</strong>. Out
512
- of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
513
  the most significant improvements on the aggregate score:</p>
514
  <ul>
515
  <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
@@ -527,15 +527,16 @@
527
  </li>
528
  </ul>
529
  <ul>
530
- <li>When applying the 3 together, ~22% of tokens were removed.</li>
531
  </ul>
532
  <div class="main-plot-container">
533
  <figure><img src="assets/images/custom_filters.png"/></figure>
534
  <div id="plot-custom_filters"></div>
535
  </div>
536
- <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance.</p>
 
537
  <h2>The final dataset</h2>
538
- <p>The final 🍷 FineWeb dataset comprises 15T tokens and
539
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
540
  of benchmark tasks:</p>
541
  <ul>
@@ -554,35 +555,40 @@
554
  <figure><img src="assets/images/filtering_steps.png"/></figure>
555
  <div id="plot-all_filtering_steps"></div>
556
  </div>
557
- <p>We compared 🍷 FineWeb with the following datasets:</p>
 
558
  <ul>
559
  <li><a
560
- href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a><d-cite bibtex-key="penedo2023refinedweb"></d-cite>
561
  </li>
562
  </ul>
563
  <ul>
564
- <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a><d-cite bibtex-key="raffel2023exploring"></d-cite></li>
565
  </ul>
566
  <ul>
567
- <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
568
- CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite>
569
  </li>
570
  </ul>
571
  <ul>
572
- <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> <d-cite bibtex-key="gao2020pile"></d-cite></li>
573
  </ul>
574
  <ul>
575
  <li><a
576
- href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
577
  </li>
578
  </ul>
579
  <ul>
580
  <li><a
581
- href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> <d-cite bibtex-key="together2023redpajama"></d-cite>
582
  (deduplicated)
583
  </li>
584
  </ul>
585
- <p>You will find these models on <a
 
 
 
 
586
  href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
587
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
588
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
@@ -591,28 +597,32 @@
591
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
592
  <div id="plot-dataset_ablations"></div>
593
  </div>
594
- <p>Large language models pretrained on 🍷 FineWeb, the largest publicly available clean LLM pretraining dataset, are better-performing than other open pretraining datasets.</p>
 
595
  <h2>📚 FineWeb-Edu</h2>
596
- <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
597
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
598
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
599
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
600
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
601
  <p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
602
- <h3>Annotation</h3>
603
- <p>We used Llama-3-70B-Instruct to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
604
- <p>We explored various prompts and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
 
605
  <div style="text-align: center; margin: 20px 0;">
606
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
607
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
608
  </div>
609
- <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
610
- <h3>Classifier Training</h3>
611
- <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
612
- <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
 
613
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
 
614
  <h3>Filtering and results</h3>
615
- <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
616
  <div class="main-plot-container">
617
  <figure>
618
  <img src="assets/images/edu-8k.png">
 
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
+ <h2>What's web data</h2>
92
+ <h3>Finding the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
 
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
+ <h3>First steps of filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
  removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
 
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
  extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
248
+ <h3>Deduplicating the data</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
 
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
+ <h3>Filtering the data even more for quality</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
 
496
  metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
497
  (0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
498
  indicating that the latter had higher inter-document repetition.</p>
499
+ <p>Following the process listed above for these datasets yielded <strong>seventeen</strong> candidate
500
+ metric-threshold pairs. In the image below, you can see three of these histograms:</p>
501
  <div class="main-plot-container">
502
  <figure><img src="assets/images/stats.png"/></figure>
503
  <div id="plot-stats"></div>
 
507
  We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
508
  </p>
509
 
510
+ <p>We then assessed the effectiveness of these seventeen newly created
511
+ filters, by conducting several of our <em>28 billion tokens</em> ablation runs on the <em>2019-18 crawl</em>. Out
512
+ of all those runs, we identified <strong>three</strong> filters (the ones based on the histograms above) that demonstrated
513
  the most significant improvements on the aggregate score:</p>
514
  <ul>
515
  <li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
 
527
  </li>
528
  </ul>
529
  <ul>
530
+ <li>When applying the three together, ~22% of tokens were removed.</li>
531
  </ul>
532
  <div class="main-plot-container">
533
  <figure><img src="assets/images/custom_filters.png"/></figure>
534
  <div id="plot-custom_filters"></div>
535
  </div>
536
+ <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
+
538
  <h2>The final dataset</h2>
539
+ <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
542
  <ul>
 
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
+ <h3>Comparisons with other web-scale datasets</h3>
559
+ <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
+ href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
563
  </li>
564
  </ul>
565
  <ul>
566
+ <li><a href="https://huggingface.co/datasets/allenai/c4">C4</a> (172B tokens)<d-cite bibtex-key="raffel2023exploring"></d-cite></li>
567
  </ul>
568
  <ul>
569
+ <li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (3T tokens) (the
570
+ CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite> <d-footnote>There is a newer version of Dolma, v1.7, which is smaller</d-footnote>
571
  </li>
572
  </ul>
573
  <ul>
574
+ <li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> (340B tokens) <d-cite bibtex-key="gao2020pile"></d-cite></li>
575
  </ul>
576
  <ul>
577
  <li><a
578
+ href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> (627B tokens) <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
579
  </li>
580
  </ul>
581
  <ul>
582
  <li><a
583
+ href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> (20T tokens) <d-cite bibtex-key="together2023redpajama"></d-cite>
584
  (deduplicated)
585
  </li>
586
  </ul>
587
+ <ul>
588
+ <li> and our new <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> (15T tokens) (this report)
589
+ </li>
590
+ </ul>
591
+ <p>You will find the 350B-tokens-trained ablation models openly accessible and gathered in <a
592
  href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
593
  collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
594
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
 
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
+ <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
601
+
602
  <h2>📚 FineWeb-Edu</h2>
603
+ <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
604
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
605
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
606
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
607
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
608
  <p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
609
+
610
+ <h3>Annotating for educational quality at scale</h3>
611
+ <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
612
+ <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
613
  <div style="text-align: center; margin: 20px 0;">
614
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
615
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
616
  </div>
617
+ <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
618
+
619
+ <h3>Training a classifier</h3>
620
+ <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
621
+ <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
622
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
623
+
624
  <h3>Filtering and results</h3>
625
+ <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that using a threshold of <code>3</code> gave the best overall results. Although using a threshold higher than <code>3</code> improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
626
  <div class="main-plot-container">
627
  <figure>
628
  <img src="assets/images/edu-8k.png">