guipenedo HF staff commited on
Commit
3985998
β€’
1 Parent(s): 9a51a2c

final pass over the text

Browse files
Files changed (2) hide show
  1. dist/index.html +21 -21
  2. src/index.html +21 -21
dist/index.html CHANGED
@@ -265,9 +265,9 @@
265
  fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each β€” targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
- <p>This would mean that for two documents with a similarity ($$s$$)
269
  of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
270
- 92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
271
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
272
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
273
  <div class="main-plot-container">
@@ -280,22 +280,22 @@
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
 
282
  <h4>More deduplication is always better, right?</h4>
283
- <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
  recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
  not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
- the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
  trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
- tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
- <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
@@ -326,7 +326,7 @@
326
  removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
- <p>We decided to experimence with alternative approaches: we deduplicated
330
  each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
@@ -362,7 +362,7 @@
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
- document in a is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
@@ -399,8 +399,8 @@
399
  removed.</p>
400
 
401
  <h4>Other (failed) global approaches</h4>
402
- <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
403
- independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
@@ -431,7 +431,7 @@
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
- <h3>Filtering the data even more for quality</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
@@ -454,7 +454,7 @@
454
  <ul>
455
  <li>applying β€œAll filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing β€œlorem
457
- ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
458
  </li>
459
  </ul>
460
  <ul>
@@ -535,7 +535,7 @@
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
- <h3>The final FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
@@ -556,7 +556,7 @@
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
  <h4>Comparisons with other web-scale datasets</h4>
559
- <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
@@ -597,7 +597,7 @@
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
- <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
601
 
602
  <h2>πŸ“š FineWeb-Edu</h2>
603
 
@@ -605,7 +605,7 @@
605
  <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
606
  <figcaption style="font-style: italic; margin-top: 10px;">πŸ“š FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
607
  </figure>
608
- <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">πŸ“š FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
609
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
610
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
611
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
@@ -614,15 +614,15 @@
614
 
615
  <h3>Annotating for educational quality at scale</h3>
616
  <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
617
- <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
618
  <div style="text-align: center; margin: 20px 0;">
619
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
620
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
621
  </div>
622
- <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
623
 
624
  <h3>Training a classifier</h3>
625
- <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
626
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
627
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
628
 
@@ -692,8 +692,8 @@
692
  <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
693
 
694
  <h2>Conclusion and looking forward</h2>
695
- <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
696
- <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
697
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open πŸ€—.</p>
698
  </d-article>
699
 
 
265
  fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each β€” targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
+ <p>This would mean that for two documents with a similarity (s)
269
  of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
270
+ 92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
271
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
272
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
273
  <div class="main-plot-container">
 
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
 
282
  <h4>More deduplication is always better, right?</h4>
283
+ <p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
  recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
  not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
+ the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
  trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
+ tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
+ <p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
 
326
  removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
+ <p>We decided to experiment with an alternative approach: we deduplicated
330
  each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
 
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
+ document is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
 
399
  removed.</p>
400
 
401
  <h4>Other (failed) global approaches</h4>
402
+ <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
403
+ independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
 
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
+ <h3>Additional quality filtering</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
 
454
  <ul>
455
  <li>applying β€œAll filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing β€œlorem
457
+ ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filters" vs "C4" curves, respectively).
458
  </li>
459
  </ul>
460
  <ul>
 
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
+ <h3>The final 🍷 FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
 
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
  <h4>Comparisons with other web-scale datasets</h4>
559
+ <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
 
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
+ <p>🍷 FineWeb is thus – to the best of our knowledge – the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
601
 
602
  <h2>πŸ“š FineWeb-Edu</h2>
603
 
 
605
  <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
606
  <figcaption style="font-style: italic; margin-top: 10px;">πŸ“š FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
607
  </figure>
608
+ <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">πŸ“š FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. πŸ“š FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
609
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
610
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
611
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
 
614
 
615
  <h3>Annotating for educational quality at scale</h3>
616
  <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
617
+ <p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
618
  <div style="text-align: center; margin: 20px 0;">
619
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
620
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
621
  </div>
622
+ <p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
623
 
624
  <h3>Training a classifier</h3>
625
+ <p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
626
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
627
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
628
 
 
692
  <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
693
 
694
  <h2>Conclusion and looking forward</h2>
695
+ <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
696
+ <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
697
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open πŸ€—.</p>
698
  </d-article>
699
 
src/index.html CHANGED
@@ -265,9 +265,9 @@
265
  fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each β€” targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
- <p>This would mean that for two documents with a similarity ($$s$$)
269
  of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
270
- 92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
271
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
272
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
273
  <div class="main-plot-container">
@@ -280,22 +280,22 @@
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
 
282
  <h4>More deduplication is always better, right?</h4>
283
- <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
  recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
  not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
- the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
  trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
- tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
- <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
@@ -326,7 +326,7 @@
326
  removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
- <p>We decided to experimence with alternative approaches: we deduplicated
330
  each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
@@ -362,7 +362,7 @@
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
- document in a is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
@@ -399,8 +399,8 @@
399
  removed.</p>
400
 
401
  <h4>Other (failed) global approaches</h4>
402
- <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
403
- independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
@@ -431,7 +431,7 @@
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
- <h3>Filtering the data even more for quality</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
@@ -454,7 +454,7 @@
454
  <ul>
455
  <li>applying β€œAll filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing β€œlorem
457
- ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
458
  </li>
459
  </ul>
460
  <ul>
@@ -535,7 +535,7 @@
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
- <h3>The final FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
@@ -556,7 +556,7 @@
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
  <h4>Comparisons with other web-scale datasets</h4>
559
- <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
@@ -597,7 +597,7 @@
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
- <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
601
 
602
  <h2>πŸ“š FineWeb-Edu</h2>
603
 
@@ -605,7 +605,7 @@
605
  <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
606
  <figcaption style="font-style: italic; margin-top: 10px;">πŸ“š FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
607
  </figure>
608
- <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">πŸ“š FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
609
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
610
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
611
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
@@ -614,15 +614,15 @@
614
 
615
  <h3>Annotating for educational quality at scale</h3>
616
  <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
617
- <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
618
  <div style="text-align: center; margin: 20px 0;">
619
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
620
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
621
  </div>
622
- <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
623
 
624
  <h3>Training a classifier</h3>
625
- <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
626
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
627
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
628
 
@@ -692,8 +692,8 @@
692
  <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
693
 
694
  <h2>Conclusion and looking forward</h2>
695
- <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
696
- <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
697
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open πŸ€—.</p>
698
  </d-article>
699
 
 
265
  fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each β€” targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
+ <p>This would mean that for two documents with a similarity (s)
269
  of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
270
+ 92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
271
  comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
272
  buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
273
  <div class="main-plot-container">
 
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
 
282
  <h4>More deduplication is always better, right?</h4>
283
+ <p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
  recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
  not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
+ the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
  trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
+ tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
+ <p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
 
326
  removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
+ <p>We decided to experiment with an alternative approach: we deduplicated
330
  each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
 
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
+ document is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
 
399
  removed.</p>
400
 
401
  <h4>Other (failed) global approaches</h4>
402
+ <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
403
+ independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
 
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
 
434
+ <h3>Additional quality filtering</h3>
435
  <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
  RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
437
  <p>We therefore set out to find new filtering steps that
 
454
  <ul>
455
  <li>applying β€œAll filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing β€œlorem
457
+ ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filters" vs "C4" curves, respectively).
458
  </li>
459
  </ul>
460
  <ul>
 
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
+ <h3>The final 🍷 FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
 
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
  <h4>Comparisons with other web-scale datasets</h4>
559
+ <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
562
  href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
 
597
  <figure><img src="assets/images/dataset_ablations.png"/></figure>
598
  <div id="plot-dataset_ablations"></div>
599
  </div>
600
+ <p>🍷 FineWeb is thus – to the best of our knowledge – the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
601
 
602
  <h2>πŸ“š FineWeb-Edu</h2>
603
 
 
605
  <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
606
  <figcaption style="font-style: italic; margin-top: 10px;">πŸ“š FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
607
  </figure>
608
+ <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">πŸ“š FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. πŸ“š FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
609
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
610
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
611
  <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
 
614
 
615
  <h3>Annotating for educational quality at scale</h3>
616
  <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
617
+ <p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
618
  <div style="text-align: center; margin: 20px 0;">
619
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
620
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
621
  </div>
622
+ <p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
623
 
624
  <h3>Training a classifier</h3>
625
+ <p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
626
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
627
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
628
 
 
692
  <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
693
 
694
  <h2>Conclusion and looking forward</h2>
695
+ <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
696
+ <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
697
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open πŸ€—.</p>
698
  </d-article>
699