update
Browse files- dist/index.html +42 -32
- src/index.html +42 -32
dist/index.html
CHANGED
@@ -88,8 +88,8 @@
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
-
<h2>
|
92 |
-
<h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
@@ -222,7 +222,7 @@
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
-
<h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
@@ -245,7 +245,7 @@
|
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
248 |
-
<h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
@@ -431,7 +431,7 @@
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
-
<h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
@@ -496,8 +496,8 @@
|
|
496 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
497 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
498 |
indicating that the latter had higher inter-document repetition.</p>
|
499 |
-
<p>Following the process listed above for these datasets yielded
|
500 |
-
metric-threshold pairs. In the image below, you can see
|
501 |
<div class="main-plot-container">
|
502 |
<figure><img src="assets/images/stats.png"/></figure>
|
503 |
<div id="plot-stats"></div>
|
@@ -507,9 +507,9 @@
|
|
507 |
We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
|
508 |
</p>
|
509 |
|
510 |
-
<p>We then assessed the effectiveness of these
|
511 |
-
filters, by conducting <
|
512 |
-
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|
513 |
the most significant improvements on the aggregate score:</p>
|
514 |
<ul>
|
515 |
<li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
|
@@ -527,15 +527,16 @@
|
|
527 |
</li>
|
528 |
</ul>
|
529 |
<ul>
|
530 |
-
<li>When applying the
|
531 |
</ul>
|
532 |
<div class="main-plot-container">
|
533 |
<figure><img src="assets/images/custom_filters.png"/></figure>
|
534 |
<div id="plot-custom_filters"></div>
|
535 |
</div>
|
536 |
-
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance.</p>
|
|
|
537 |
<h2>The final dataset</h2>
|
538 |
-
<p>The final
|
539 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
540 |
of benchmark tasks:</p>
|
541 |
<ul>
|
@@ -554,35 +555,40 @@
|
|
554 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
555 |
<div id="plot-all_filtering_steps"></div>
|
556 |
</div>
|
557 |
-
<
|
|
|
558 |
<ul>
|
559 |
<li><a
|
560 |
-
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a
|
561 |
</li>
|
562 |
</ul>
|
563 |
<ul>
|
564 |
-
<li><a href="https://huggingface.co/datasets/allenai/c4">C4</a
|
565 |
</ul>
|
566 |
<ul>
|
567 |
-
<li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
|
568 |
-
CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite>
|
569 |
</li>
|
570 |
</ul>
|
571 |
<ul>
|
572 |
-
<li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> <d-cite bibtex-key="gao2020pile"></d-cite></li>
|
573 |
</ul>
|
574 |
<ul>
|
575 |
<li><a
|
576 |
-
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
|
577 |
</li>
|
578 |
</ul>
|
579 |
<ul>
|
580 |
<li><a
|
581 |
-
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> <d-cite bibtex-key="together2023redpajama"></d-cite>
|
582 |
(deduplicated)
|
583 |
</li>
|
584 |
</ul>
|
585 |
-
<
|
|
|
|
|
|
|
|
|
586 |
href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
|
587 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
588 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
@@ -591,28 +597,32 @@
|
|
591 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
592 |
<div id="plot-dataset_ablations"></div>
|
593 |
</div>
|
594 |
-
<p
|
|
|
595 |
<h2>📚 FineWeb-Edu</h2>
|
596 |
-
<p>
|
597 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
598 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
599 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
600 |
<blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
|
601 |
<p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
|
602 |
-
|
603 |
-
<
|
604 |
-
<p>We
|
|
|
605 |
<div style="text-align: center; margin: 20px 0;">
|
606 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
607 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
608 |
</div>
|
609 |
-
<p>
|
610 |
-
|
611 |
-
<
|
612 |
-
<p>
|
|
|
613 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
|
|
614 |
<h3>Filtering and results</h3>
|
615 |
-
<p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
616 |
<div class="main-plot-container">
|
617 |
<figure>
|
618 |
<img src="assets/images/edu-8k.png">
|
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
+
<h2>What's web data</h2>
|
92 |
+
<h3>Finding the data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
+
<h3>First steps of filtering</h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
|
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
248 |
+
<h3>Deduplicating the data</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
+
<h3>Filtering the data even more for quality</h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
|
|
496 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
497 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
498 |
indicating that the latter had higher inter-document repetition.</p>
|
499 |
+
<p>Following the process listed above for these datasets yielded <strong>seventeen</strong> candidate
|
500 |
+
metric-threshold pairs. In the image below, you can see three of these histograms:</p>
|
501 |
<div class="main-plot-container">
|
502 |
<figure><img src="assets/images/stats.png"/></figure>
|
503 |
<div id="plot-stats"></div>
|
|
|
507 |
We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
|
508 |
</p>
|
509 |
|
510 |
+
<p>We then assessed the effectiveness of these seventeen newly created
|
511 |
+
filters, by conducting several of our <em>28 billion tokens</em> ablation runs on the <em>2019-18 crawl</em>. Out
|
512 |
+
of all those runs, we identified <strong>three</strong> filters (the ones based on the histograms above) that demonstrated
|
513 |
the most significant improvements on the aggregate score:</p>
|
514 |
<ul>
|
515 |
<li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
|
|
|
527 |
</li>
|
528 |
</ul>
|
529 |
<ul>
|
530 |
+
<li>When applying the three together, ~22% of tokens were removed.</li>
|
531 |
</ul>
|
532 |
<div class="main-plot-container">
|
533 |
<figure><img src="assets/images/custom_filters.png"/></figure>
|
534 |
<div id="plot-custom_filters"></div>
|
535 |
</div>
|
536 |
+
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
+
|
538 |
<h2>The final dataset</h2>
|
539 |
+
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
542 |
<ul>
|
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
+
<h3>Comparisons with other web-scale datasets</h3>
|
559 |
+
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
562 |
+
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
563 |
</li>
|
564 |
</ul>
|
565 |
<ul>
|
566 |
+
<li><a href="https://huggingface.co/datasets/allenai/c4">C4</a> (172B tokens)<d-cite bibtex-key="raffel2023exploring"></d-cite></li>
|
567 |
</ul>
|
568 |
<ul>
|
569 |
+
<li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (3T tokens) (the
|
570 |
+
CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite> <d-footnote>There is a newer version of Dolma, v1.7, which is smaller</d-footnote>
|
571 |
</li>
|
572 |
</ul>
|
573 |
<ul>
|
574 |
+
<li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> (340B tokens) <d-cite bibtex-key="gao2020pile"></d-cite></li>
|
575 |
</ul>
|
576 |
<ul>
|
577 |
<li><a
|
578 |
+
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> (627B tokens) <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
|
579 |
</li>
|
580 |
</ul>
|
581 |
<ul>
|
582 |
<li><a
|
583 |
+
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> (20T tokens) <d-cite bibtex-key="together2023redpajama"></d-cite>
|
584 |
(deduplicated)
|
585 |
</li>
|
586 |
</ul>
|
587 |
+
<ul>
|
588 |
+
<li> and our new <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> (15T tokens) (this report)
|
589 |
+
</li>
|
590 |
+
</ul>
|
591 |
+
<p>You will find the 350B-tokens-trained ablation models openly accessible and gathered in <a
|
592 |
href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
|
593 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
594 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
+
<p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
|
601 |
+
|
602 |
<h2>📚 FineWeb-Edu</h2>
|
603 |
+
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
|
604 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
605 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
606 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
607 |
<blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
|
608 |
<p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
|
609 |
+
|
610 |
+
<h3>Annotating for educational quality at scale</h3>
|
611 |
+
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
612 |
+
<p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
613 |
<div style="text-align: center; margin: 20px 0;">
|
614 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
615 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
616 |
</div>
|
617 |
+
<p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
|
618 |
+
|
619 |
+
<h3>Training a classifier</h3>
|
620 |
+
<p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
|
621 |
+
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
622 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
623 |
+
|
624 |
<h3>Filtering and results</h3>
|
625 |
+
<p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that using a threshold of <code>3</code> gave the best overall results. Although using a threshold higher than <code>3</code> improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
626 |
<div class="main-plot-container">
|
627 |
<figure>
|
628 |
<img src="assets/images/edu-8k.png">
|
src/index.html
CHANGED
@@ -88,8 +88,8 @@
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
-
<h2>
|
92 |
-
<h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
@@ -222,7 +222,7 @@
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
-
<h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
@@ -245,7 +245,7 @@
|
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
248 |
-
<h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
@@ -431,7 +431,7 @@
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
-
<h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
@@ -496,8 +496,8 @@
|
|
496 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
497 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
498 |
indicating that the latter had higher inter-document repetition.</p>
|
499 |
-
<p>Following the process listed above for these datasets yielded
|
500 |
-
metric-threshold pairs. In the image below, you can see
|
501 |
<div class="main-plot-container">
|
502 |
<figure><img src="assets/images/stats.png"/></figure>
|
503 |
<div id="plot-stats"></div>
|
@@ -507,9 +507,9 @@
|
|
507 |
We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
|
508 |
</p>
|
509 |
|
510 |
-
<p>We then assessed the effectiveness of these
|
511 |
-
filters, by conducting <
|
512 |
-
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
|
513 |
the most significant improvements on the aggregate score:</p>
|
514 |
<ul>
|
515 |
<li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
|
@@ -527,15 +527,16 @@
|
|
527 |
</li>
|
528 |
</ul>
|
529 |
<ul>
|
530 |
-
<li>When applying the
|
531 |
</ul>
|
532 |
<div class="main-plot-container">
|
533 |
<figure><img src="assets/images/custom_filters.png"/></figure>
|
534 |
<div id="plot-custom_filters"></div>
|
535 |
</div>
|
536 |
-
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance.</p>
|
|
|
537 |
<h2>The final dataset</h2>
|
538 |
-
<p>The final
|
539 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
540 |
of benchmark tasks:</p>
|
541 |
<ul>
|
@@ -554,35 +555,40 @@
|
|
554 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
555 |
<div id="plot-all_filtering_steps"></div>
|
556 |
</div>
|
557 |
-
<
|
|
|
558 |
<ul>
|
559 |
<li><a
|
560 |
-
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a
|
561 |
</li>
|
562 |
</ul>
|
563 |
<ul>
|
564 |
-
<li><a href="https://huggingface.co/datasets/allenai/c4">C4</a
|
565 |
</ul>
|
566 |
<ul>
|
567 |
-
<li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
|
568 |
-
CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite>
|
569 |
</li>
|
570 |
</ul>
|
571 |
<ul>
|
572 |
-
<li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> <d-cite bibtex-key="gao2020pile"></d-cite></li>
|
573 |
</ul>
|
574 |
<ul>
|
575 |
<li><a
|
576 |
-
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
|
577 |
</li>
|
578 |
</ul>
|
579 |
<ul>
|
580 |
<li><a
|
581 |
-
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> <d-cite bibtex-key="together2023redpajama"></d-cite>
|
582 |
(deduplicated)
|
583 |
</li>
|
584 |
</ul>
|
585 |
-
<
|
|
|
|
|
|
|
|
|
586 |
href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
|
587 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
588 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
@@ -591,28 +597,32 @@
|
|
591 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
592 |
<div id="plot-dataset_ablations"></div>
|
593 |
</div>
|
594 |
-
<p
|
|
|
595 |
<h2>📚 FineWeb-Edu</h2>
|
596 |
-
<p>
|
597 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
598 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
599 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
600 |
<blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
|
601 |
<p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
|
602 |
-
|
603 |
-
<
|
604 |
-
<p>We
|
|
|
605 |
<div style="text-align: center; margin: 20px 0;">
|
606 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
607 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
608 |
</div>
|
609 |
-
<p>
|
610 |
-
|
611 |
-
<
|
612 |
-
<p>
|
|
|
613 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
|
|
614 |
<h3>Filtering and results</h3>
|
615 |
-
<p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. Although using a threshold higher than 3 improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
616 |
<div class="main-plot-container">
|
617 |
<figure>
|
618 |
<img src="assets/images/edu-8k.png">
|
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
+
<h2>What's web data</h2>
|
92 |
+
<h3>Finding the data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
+
<h3>First steps of filtering</h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
|
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
248 |
+
<h3>Deduplicating the data</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
+
<h3>Filtering the data even more for quality</h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
|
|
496 |
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
|
497 |
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the global dedup (0.011 for 2015-22 and 0.01 for 2013-48),
|
498 |
indicating that the latter had higher inter-document repetition.</p>
|
499 |
+
<p>Following the process listed above for these datasets yielded <strong>seventeen</strong> candidate
|
500 |
+
metric-threshold pairs. In the image below, you can see three of these histograms:</p>
|
501 |
<div class="main-plot-container">
|
502 |
<figure><img src="assets/images/stats.png"/></figure>
|
503 |
<div id="plot-stats"></div>
|
|
|
507 |
We then filtered with this threshold and found that the removed data had a higher amount of short lists or consisted of only document layout text ("Home", "Sign up", etc).
|
508 |
</p>
|
509 |
|
510 |
+
<p>We then assessed the effectiveness of these seventeen newly created
|
511 |
+
filters, by conducting several of our <em>28 billion tokens</em> ablation runs on the <em>2019-18 crawl</em>. Out
|
512 |
+
of all those runs, we identified <strong>three</strong> filters (the ones based on the histograms above) that demonstrated
|
513 |
the most significant improvements on the aggregate score:</p>
|
514 |
<ul>
|
515 |
<li>Remove documents where the fraction of lines ending with punctuation ≤ 0.12
|
|
|
527 |
</li>
|
528 |
</ul>
|
529 |
<ul>
|
530 |
+
<li>When applying the three together, ~22% of tokens were removed.</li>
|
531 |
</ul>
|
532 |
<div class="main-plot-container">
|
533 |
<figure><img src="assets/images/custom_filters.png"/></figure>
|
534 |
<div id="plot-custom_filters"></div>
|
535 |
</div>
|
536 |
+
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
+
|
538 |
<h2>The final dataset</h2>
|
539 |
+
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
542 |
<ul>
|
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
+
<h3>Comparisons with other web-scale datasets</h3>
|
559 |
+
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
562 |
+
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
563 |
</li>
|
564 |
</ul>
|
565 |
<ul>
|
566 |
+
<li><a href="https://huggingface.co/datasets/allenai/c4">C4</a> (172B tokens)<d-cite bibtex-key="raffel2023exploring"></d-cite></li>
|
567 |
</ul>
|
568 |
<ul>
|
569 |
+
<li><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (3T tokens) (the
|
570 |
+
CommonCrawl part) <d-cite bibtex-key="dolma"></d-cite> <d-footnote>There is a newer version of Dolma, v1.7, which is smaller</d-footnote>
|
571 |
</li>
|
572 |
</ul>
|
573 |
<ul>
|
574 |
+
<li><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a> (340B tokens) <d-cite bibtex-key="gao2020pile"></d-cite></li>
|
575 |
</ul>
|
576 |
<ul>
|
577 |
<li><a
|
578 |
+
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> (627B tokens) <d-cite bibtex-key="cerebras2023slimpajama"></d-cite>
|
579 |
</li>
|
580 |
</ul>
|
581 |
<ul>
|
582 |
<li><a
|
583 |
+
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> (20T tokens) <d-cite bibtex-key="together2023redpajama"></d-cite>
|
584 |
(deduplicated)
|
585 |
</li>
|
586 |
</ul>
|
587 |
+
<ul>
|
588 |
+
<li> and our new <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> (15T tokens) (this report)
|
589 |
+
</li>
|
590 |
+
</ul>
|
591 |
+
<p>You will find the 350B-tokens-trained ablation models openly accessible and gathered in <a
|
592 |
href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
|
593 |
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
|
594 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
|
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
+
<p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
|
601 |
+
|
602 |
<h2>📚 FineWeb-Edu</h2>
|
603 |
+
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
|
604 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
605 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
606 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
607 |
<blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
|
608 |
<p>However, these classifiers and filtered datasets are not publicly available. To further enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to create <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>.</p>
|
609 |
+
|
610 |
+
<h3>Annotating for educational quality at scale</h3>
|
611 |
+
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
612 |
+
<p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
613 |
<div style="text-align: center; margin: 20px 0;">
|
614 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
615 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
616 |
</div>
|
617 |
+
<p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
|
618 |
+
|
619 |
+
<h3>Training a classifier</h3>
|
620 |
+
<p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
|
621 |
+
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
622 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
623 |
+
|
624 |
<h3>Filtering and results</h3>
|
625 |
+
<p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that using a threshold of <code>3</code> gave the best overall results. Although using a threshold higher than <code>3</code> improves performance on knowledge and reasoning intensive benchmarks, it significantly degrades performance on HellaSwag and PIQA. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
626 |
<div class="main-plot-container">
|
627 |
<figure>
|
628 |
<img src="assets/images/edu-8k.png">
|