final pass over the text
Browse files- dist/index.html +21 -21
- src/index.html +21 -21
dist/index.html
CHANGED
@@ -265,9 +265,9 @@
|
|
265 |
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each β targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
-
<p>This would mean that for two documents with a similarity (
|
269 |
of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
|
270 |
-
92% and 98.8% respectively (
|
271 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
272 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
273 |
<div class="main-plot-container">
|
@@ -280,22 +280,22 @@
|
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
-
<p>
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
-
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
-
tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
-
<p>This
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
@@ -326,7 +326,7 @@
|
|
326 |
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
-
<p>We decided to
|
330 |
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
@@ -362,7 +362,7 @@
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
-
document
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
@@ -399,8 +399,8 @@
|
|
399 |
removed.</p>
|
400 |
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
-
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to
|
403 |
-
independently minhash deduped 20 trillion tokens of data (
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) β <em>FineWeb URL dedup</em></li>
|
@@ -431,7 +431,7 @@
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
-
<h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
@@ -454,7 +454,7 @@
|
|
454 |
<ul>
|
455 |
<li>applying βAll filtersβ (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing βlorem
|
457 |
-
ipsumβ or a curly bracket, <code>{</code>) allows us to match C4βs HellaSwag performance ("All
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
@@ -535,7 +535,7 @@
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
-
<h3>The final FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
@@ -556,7 +556,7 @@
|
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
-
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets
|
560 |
<ul>
|
561 |
<li><a
|
562 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
@@ -597,7 +597,7 @@
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
-
<p>π· FineWeb is thus β
|
601 |
|
602 |
<h2>π FineWeb-Edu</h2>
|
603 |
|
@@ -605,7 +605,7 @@
|
|
605 |
<img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
|
606 |
<figcaption style="font-style: italic; margin-top: 10px;">π FineWeb-Edu outperforms π· FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
|
607 |
</figure>
|
608 |
-
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">π FineWeb-Edu</a> is an additional
|
609 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
610 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
611 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
@@ -614,15 +614,15 @@
|
|
614 |
|
615 |
<h3>Annotating for educational quality at scale</h3>
|
616 |
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from π· FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
617 |
-
<p>We explored various prompt
|
618 |
<div style="text-align: center; margin: 20px 0;">
|
619 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
620 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
621 |
</div>
|
622 |
-
<p>In terms of open-weight
|
623 |
|
624 |
<h3>Training a classifier</h3>
|
625 |
-
<p>To scale our
|
626 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
627 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
628 |
|
@@ -692,8 +692,8 @@
|
|
692 |
<p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
|
693 |
|
694 |
<h2>Conclusion and looking forward</h2>
|
695 |
-
<p>Through our open science efforts we hope to
|
696 |
-
<p>In
|
697 |
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open π€.</p>
|
698 |
</d-article>
|
699 |
|
|
|
265 |
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each β targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
+
<p>This would mean that for two documents with a similarity (s)
|
269 |
of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
|
270 |
+
92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
|
271 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
272 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
273 |
<div class="main-plot-container">
|
|
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
+
<p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
+
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
+
tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
+
<p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
|
|
326 |
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
+
<p>We decided to experiment with an alternative approach: we deduplicated
|
330 |
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
+
document is unique in this dump)
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
|
|
399 |
removed.</p>
|
400 |
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
+
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
|
403 |
+
independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) β <em>FineWeb URL dedup</em></li>
|
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
+
<h3>Additional quality filtering</h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
|
|
454 |
<ul>
|
455 |
<li>applying βAll filtersβ (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing βlorem
|
457 |
+
ipsumβ or a curly bracket, <code>{</code>) allows us to match C4βs HellaSwag performance ("All filters" vs "C4" curves, respectively).
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
+
<h3>The final π· FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
|
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
+
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
562 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
+
<p>π· FineWeb is thus β to the best of our knowledge β the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
|
601 |
|
602 |
<h2>π FineWeb-Edu</h2>
|
603 |
|
|
|
605 |
<img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
|
606 |
<figcaption style="font-style: italic; margin-top: 10px;">π FineWeb-Edu outperforms π· FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
|
607 |
</figure>
|
608 |
+
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">π FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. π FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
|
609 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
610 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
611 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
|
|
614 |
|
615 |
<h3>Annotating for educational quality at scale</h3>
|
616 |
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from π· FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
617 |
+
<p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
618 |
<div style="text-align: center; margin: 20px 0;">
|
619 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
620 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
621 |
</div>
|
622 |
+
<p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
|
623 |
|
624 |
<h3>Training a classifier</h3>
|
625 |
+
<p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
|
626 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
627 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
628 |
|
|
|
692 |
<p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
|
693 |
|
694 |
<h2>Conclusion and looking forward</h2>
|
695 |
+
<p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
696 |
+
<p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
|
697 |
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open π€.</p>
|
698 |
</d-article>
|
699 |
|
src/index.html
CHANGED
@@ -265,9 +265,9 @@
|
|
265 |
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each β targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
-
<p>This would mean that for two documents with a similarity (
|
269 |
of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
|
270 |
-
92% and 98.8% respectively (
|
271 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
272 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
273 |
<div class="main-plot-container">
|
@@ -280,22 +280,22 @@
|
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
-
<p>
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
-
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
-
tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
-
<p>This
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
@@ -326,7 +326,7 @@
|
|
326 |
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
-
<p>We decided to
|
330 |
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
@@ -362,7 +362,7 @@
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
-
document
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
@@ -399,8 +399,8 @@
|
|
399 |
removed.</p>
|
400 |
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
-
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to
|
403 |
-
independently minhash deduped 20 trillion tokens of data (
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) β <em>FineWeb URL dedup</em></li>
|
@@ -431,7 +431,7 @@
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
-
<h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
@@ -454,7 +454,7 @@
|
|
454 |
<ul>
|
455 |
<li>applying βAll filtersβ (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing βlorem
|
457 |
-
ipsumβ or a curly bracket, <code>{</code>) allows us to match C4βs HellaSwag performance ("All
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
@@ -535,7 +535,7 @@
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
-
<h3>The final FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
@@ -556,7 +556,7 @@
|
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
-
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets
|
560 |
<ul>
|
561 |
<li><a
|
562 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
@@ -597,7 +597,7 @@
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
-
<p>π· FineWeb is thus β
|
601 |
|
602 |
<h2>π FineWeb-Edu</h2>
|
603 |
|
@@ -605,7 +605,7 @@
|
|
605 |
<img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
|
606 |
<figcaption style="font-style: italic; margin-top: 10px;">π FineWeb-Edu outperforms π· FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
|
607 |
</figure>
|
608 |
-
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">π FineWeb-Edu</a> is an additional
|
609 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
610 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
611 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
@@ -614,15 +614,15 @@
|
|
614 |
|
615 |
<h3>Annotating for educational quality at scale</h3>
|
616 |
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from π· FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
617 |
-
<p>We explored various prompt
|
618 |
<div style="text-align: center; margin: 20px 0;">
|
619 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
620 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
621 |
</div>
|
622 |
-
<p>In terms of open-weight
|
623 |
|
624 |
<h3>Training a classifier</h3>
|
625 |
-
<p>To scale our
|
626 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
627 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
628 |
|
@@ -692,8 +692,8 @@
|
|
692 |
<p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
|
693 |
|
694 |
<h2>Conclusion and looking forward</h2>
|
695 |
-
<p>Through our open science efforts we hope to
|
696 |
-
<p>In
|
697 |
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open π€.</p>
|
698 |
</d-article>
|
699 |
|
|
|
265 |
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each β targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
+
<p>This would mean that for two documents with a similarity (s)
|
269 |
of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
|
270 |
+
92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
|
271 |
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
|
272 |
buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
|
273 |
<div class="main-plot-container">
|
|
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
+
<p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
+
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
+
tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
+
<p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
|
|
326 |
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
+
<p>We decided to experiment with an alternative approach: we deduplicated
|
330 |
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
+
document is unique in this dump)
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
|
|
399 |
removed.</p>
|
400 |
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
+
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
|
403 |
+
independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) β <em>FineWeb URL dedup</em></li>
|
|
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
|
434 |
+
<h3>Additional quality filtering</h3>
|
435 |
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
437 |
<p>We therefore set out to find new filtering steps that
|
|
|
454 |
<ul>
|
455 |
<li>applying βAll filtersβ (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing βlorem
|
457 |
+
ipsumβ or a curly bracket, <code>{</code>) allows us to match C4βs HellaSwag performance ("All filters" vs "C4" curves, respectively).
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
+
<h3>The final π· FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
|
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
+
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
562 |
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
|
|
|
597 |
<figure><img src="assets/images/dataset_ablations.png"/></figure>
|
598 |
<div id="plot-dataset_ablations"></div>
|
599 |
</div>
|
600 |
+
<p>π· FineWeb is thus β to the best of our knowledge β the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
|
601 |
|
602 |
<h2>π FineWeb-Edu</h2>
|
603 |
|
|
|
605 |
<img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
|
606 |
<figcaption style="font-style: italic; margin-top: 10px;">π FineWeb-Edu outperforms π· FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
|
607 |
</figure>
|
608 |
+
<p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">π FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. π FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
|
609 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
|
610 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
611 |
<p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
|
|
|
614 |
|
615 |
<h3>Annotating for educational quality at scale</h3>
|
616 |
<p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from π· FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
|
617 |
+
<p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
618 |
<div style="text-align: center; margin: 20px 0;">
|
619 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
620 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
621 |
</div>
|
622 |
+
<p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
|
623 |
|
624 |
<h3>Training a classifier</h3>
|
625 |
+
<p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
|
626 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
627 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
628 |
|
|
|
692 |
<p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
|
693 |
|
694 |
<h2>Conclusion and looking forward</h2>
|
695 |
+
<p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
696 |
+
<p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
|
697 |
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open π€.</p>
|
698 |
</d-article>
|
699 |
|