hynky HF staff commited on
Commit
ea35903
Β·
2 Parent(s): ffc056a 5385888

Merge branch 'main' of hf.co:spaces/HuggingFaceFW/blogpost

Browse files
Files changed (4) hide show
  1. bibliography.bib +19 -0
  2. index.html +55 -50
  3. src/distill.js +0 -0
  4. style.css +13 -0
bibliography.bib CHANGED
@@ -190,4 +190,23 @@
190
  eprint={2205.10487},
191
  archivePrefix={arXiv},
192
  primaryClass={cs.LG}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  }
 
190
  eprint={2205.10487},
191
  archivePrefix={arXiv},
192
  primaryClass={cs.LG}
193
+ }
194
+ @article{llama3modelcard,
195
+
196
+ title={Llama 3 Model Card},
197
+
198
+ author={AI@Meta},
199
+
200
+ year={2024},
201
+
202
+ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
203
+
204
+ }
205
+ @misc{jiang2024mixtral,
206
+ title={Mixtral of Experts},
207
+ author={Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and LΓ©lio Renard Lavaud and Lucile Saulnier and Marie-Anne Lachaux and Pierre Stock and Sandeep Subramanian and Sophia Yang and Szymon Antoniak and Teven Le Scao and ThΓ©ophile Gervet and Thibaut Lavril and Thomas Wang and TimothΓ©e Lacroix and William El Sayed},
208
+ year={2024},
209
+ eprint={2401.04088},
210
+ archivePrefix={arXiv},
211
+ primaryClass={cs.LG}
212
  }
index.html CHANGED
@@ -1,7 +1,7 @@
1
  <!doctype html>
2
 
3
  <head>
4
- <script src="https://distill.pub/template.v2.js"></script>
5
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
6
  <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
7
  <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
@@ -122,27 +122,19 @@
122
  <body>
123
  <d-front-matter>
124
  <script id='distill-front-matter' type="text/json">{
125
- "title": "FineWeb: 15T tokens of high quality web data",
126
- "description": "This blog covers the FineWeb recipe, why more deduplication is not always better and some interesting findings on the difference in quality of CommonCrawl dumps.",
127
  "published": "May 28, 2024",
 
128
  "authors": [
129
  {
130
  "author":"Guilherme Penedo",
131
- "authorURL":"https://huggingface.co/guipenedo",
132
- "affiliations": [{"name": "HuggingFace"}]
133
  },
134
  {
135
  "author":"Hynek Kydlíček",
136
  "authorURL":"https://huggingface.co/hynky"
137
  },
138
- {
139
- "author":"Leandro Werra",
140
- "authorURL":"https://huggingface.co/lvwerra"
141
- },
142
- {
143
- "author":"Thomas Wolf",
144
- "authorURL":"https://huggingface.co/thomwolf"
145
- },
146
  {
147
  "author":"Loubna Ben Allal",
148
  "authorURL":"https://huggingface.co/loubnabnl"
@@ -150,6 +142,18 @@
150
  {
151
  "author":"Anton Lozhkov",
152
  "authorURL":"https://huggingface.co/anton-l"
 
 
 
 
 
 
 
 
 
 
 
 
153
  }
154
  ],
155
  "katex": {
@@ -174,18 +178,18 @@
174
  </d-contents>
175
 
176
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
177
- (<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
178
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
179
- <p>[TODO: ADD MORE INTRODUCTION]</p>
180
- <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>πŸ“š FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
 
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
182
 
183
- <p>As 🍷FineWeb has gathered a lot of interest from the
184
  community, we decided to further explain the steps involved in creating it, our processing decisions and
185
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
186
- <p><strong>TLDR:</strong> This blog covers the FineWeb
187
- recipe, why more deduplication is not always better and some interesting findings on the difference in
188
- quality of CommonCrawl dumps.</p>
189
 
190
  <h2>General considerations on web data</h2>
191
  <h3>Sourcing the data</h3>
@@ -201,13 +205,13 @@
201
  <li>you use a public repository of crawled webpages, like the one maintained by
202
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
203
  </ul>
204
- <p>For FineWeb, similarly to what was done for a large number
205
  of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
206
- They have been crawling the web since 2007 (long before LLMs were a thing) and release a new dump usually
207
  every 1 or 2 months, which can be freely downloaded. </p>
208
- <p>As an example, their latest crawl (2024-10) contains 3.16
209
- billion web pages, totaling 424.7 TiB of uncompressed HTML text content (the size changes from dump to dump). There
210
- are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
211
  <h3>Processing at scale</h3>
212
  <p>Given the sheer size of the data involved, one of the main
213
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
@@ -216,7 +220,7 @@
216
  <p>For this purpose, we developed <a
217
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
218
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
219
- CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
220
  href="https://github.com/huggingface/datatrove">library</a>.</p>
221
  <h3>What is clean, good data?</h3>
222
  <p>This is probably the main question to keep in mind when
@@ -324,7 +328,7 @@
324
  </li>
325
  </ul>
326
  <p>After applying this filtering to each of the text
327
- extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
328
  tokenized with the <code>gpt2</code> tokenizer).</p>
329
  <h3>Deduplication</h3>
330
  <p>Deduplication is another important step, specially for web
@@ -361,7 +365,7 @@
361
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
362
  <h4>More deduplication is always better, right?</h4>
363
  <p>Our initial approach was to take the entire dataset (all
364
- 95 dumps) and deduplicate them as one big dataset using MinHash.</p>
365
  <p>We did this in an iterative manner: starting with the most
366
  recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
367
  not only against itself but also by removing any matches with duplicates from the previously processed
@@ -485,18 +489,18 @@
485
  independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
486
  <ul>
487
  <li>URL deduplication, where we only kept one document per normalized
488
- (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
489
  </ul>
490
  <ul>
491
  <li>Line deduplication:
492
  <ul>
493
  <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
494
- tokens dropped, 4.4T left) β€” <em>FineWeb line dedup</em></li>
495
  </ul>
496
  <ul>
497
  <li>same as above, but only removing duplicate lines with at least 10
498
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
499
- dropped, 2.9T left) β€” <em>FineWeb line dedup w/ min words</em></li>
500
  </ul>
501
  <ul>
502
  <li>remove all but 1 occurrence of each span of 3 duplicated lines
@@ -529,7 +533,7 @@
529
  benchmark, one of the benchmarks in our β€œearly signal” group with the stronger signal and highest
530
  signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
531
  the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
532
- each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump:</p>
533
  <div class="main-plot-container">
534
  <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
535
  <div id="plot-c4_filters_hellaswag"></div>
@@ -614,7 +618,7 @@
614
  <div id="plot-custom-filters"></div>
615
  </div>
616
  <h2>The final dataset</h2>
617
- <p>The final FineWeb dataset comprises 15T tokens and
618
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
619
  of benchmark tasks:</p>
620
  <ul>
@@ -671,7 +675,7 @@
671
  <div id="plot-dataset_ablations"></div>
672
  </div>
673
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
674
- FineWeb:</p>
675
  <figure><img src="plots/Untitled%203.png"/></figure>
676
  <h2>πŸ“š FineWeb-Edu</h2>
677
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
@@ -679,33 +683,34 @@
679
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
680
  <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
681
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
682
- <p>However, these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create FineWeb-Edu.</p>
683
  <h3>Annotation</h3>
684
- <p>We used Llama3-70B-Instruct to annotate 500k samples from the FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
685
  <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
686
  <div style="text-align: center; margin: 20px 0;">
687
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
688
- <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score</figcaption>
689
- </div>
690
- <p>We also experimented with different LLMs: <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a>, <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a>, and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>. Llama3 and Mixtral-8x22B produced similar scores, while Mixtral-8x7B tended to be more generous, not fully adhering to the score scale. <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a> suggest using multiple LLMs as juries. We tried averaging the scores from the three models, but this shifted the distribution to the right due to the higher scores from Mixtral-8x7B. Training on a dataset filtered with a classifier using jury annotations performed worse than using a classifier based on Llama3 annotations. We hypothesize that the jury-based approach retains more low-quality samples.</p>
691
- <div style="text-align: center; margin: 20px 0;">
692
- <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/dQskZA-4fsk8aR_8g9evJ.png" style="width: 80%; max-width: 700px; height: auto;"></figure>
693
  </div>
 
694
  <h3>Classifier Training</h3>
695
- <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our validation set. After training, we rounded the scores to integers from 0 to 5. This approach resulted in the model achieving an F1 score of 82%, indicating robust performance in distinguishing high-quality educational content.</p>
 
696
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
697
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
698
- <h3>Filtering</h3>
699
- <p>We applied the classifier to the 15T tokens of FineWeb, a process that required 6,000 H100 GPU hours. To build FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
700
- <ul>
701
- <li>FineWeb-Edu surpasses FineWeb and all other web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
 
 
 
 
702
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
703
- <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
704
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
705
  </ul>
706
- <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
707
- <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
708
- <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
709
  <h2>Next steps</h2>
710
  <p>We want to continue improving FineWeb and will also
711
  release a technical report with more details soon.</p>
 
1
  <!doctype html>
2
 
3
  <head>
4
+ <script src="src/distill.js"></script>
5
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
6
  <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
7
  <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
 
122
  <body>
123
  <d-front-matter>
124
  <script id='distill-front-matter' type="text/json">{
125
+ "title": "🍷 FineWeb: decanting the web for the finest text data at scale",
126
+ "description": "This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create πŸ“š FineWeb-Edu.",
127
  "published": "May 28, 2024",
128
+ "affiliation": {"name": "HuggingFace"},
129
  "authors": [
130
  {
131
  "author":"Guilherme Penedo",
132
+ "authorURL":"https://huggingface.co/guipenedo"
 
133
  },
134
  {
135
  "author":"Hynek Kydlíček",
136
  "authorURL":"https://huggingface.co/hynky"
137
  },
 
 
 
 
 
 
 
 
138
  {
139
  "author":"Loubna Ben Allal",
140
  "authorURL":"https://huggingface.co/loubnabnl"
 
142
  {
143
  "author":"Anton Lozhkov",
144
  "authorURL":"https://huggingface.co/anton-l"
145
+ },
146
+ {
147
+ "author":"Colin Raffel",
148
+ "authorURL":"https://huggingface.co/craffel"
149
+ },
150
+ {
151
+ "author":"Leandro Werra",
152
+ "authorURL":"https://huggingface.co/lvwerra"
153
+ },
154
+ {
155
+ "author":"Thomas Wolf",
156
+ "authorURL":"https://huggingface.co/thomwolf"
157
  }
158
  ],
159
  "katex": {
 
178
  </d-contents>
179
 
180
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
181
+ (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
182
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
183
+ <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
184
+ <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
185
+ <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>πŸ“š FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes: <strong>1.3 trillion (very high quality) and 5.4 trillion (high quality) tokens</strong>. πŸ“š FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
186
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
187
 
188
+ <p>As 🍷 FineWeb has gathered a lot of interest from the
189
  community, we decided to further explain the steps involved in creating it, our processing decisions and
190
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
191
+ <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
192
+ recipe (listing and explaining all of our design choices), and the process followed to create πŸ“š FineWeb-Edu.</p>
 
193
 
194
  <h2>General considerations on web data</h2>
195
  <h3>Sourcing the data</h3>
 
205
  <li>you use a public repository of crawled webpages, like the one maintained by
206
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
207
  </ul>
208
+ <p>For 🍷 FineWeb, similarly to what was done for a large number
209
  of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
210
+ They have been crawling the web since 2007 (long before LLMs became widespread) and release a new dump usually
211
  every 1 or 2 months, which can be freely downloaded. </p>
212
+ <p>As an example, their latest crawl (2024-18) contains 2.7
213
+ billion web pages, totaling 386 TiB of uncompressed HTML text content (the size changes from dump to dump). There
214
+ are 96 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
215
  <h3>Processing at scale</h3>
216
  <p>Given the sheer size of the data involved, one of the main
217
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
 
220
  <p>For this purpose, we developed <a
221
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
222
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
223
+ CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
224
  href="https://github.com/huggingface/datatrove">library</a>.</p>
225
  <h3>What is clean, good data?</h3>
226
  <p>This is probably the main question to keep in mind when
 
328
  </li>
329
  </ul>
330
  <p>After applying this filtering to each of the text
331
+ extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data (when
332
  tokenized with the <code>gpt2</code> tokenizer).</p>
333
  <h3>Deduplication</h3>
334
  <p>Deduplication is another important step, specially for web
 
365
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
366
  <h4>More deduplication is always better, right?</h4>
367
  <p>Our initial approach was to take the entire dataset (all
368
+ 96 dumps) and deduplicate them as one big dataset using MinHash.</p>
369
  <p>We did this in an iterative manner: starting with the most
370
  recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
371
  not only against itself but also by removing any matches with duplicates from the previously processed
 
489
  independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
490
  <ul>
491
  <li>URL deduplication, where we only kept one document per normalized
492
+ (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>🍷 FineWeb URL dedup</em></li>
493
  </ul>
494
  <ul>
495
  <li>Line deduplication:
496
  <ul>
497
  <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
498
+ tokens dropped, 4.4T left) β€” <em>🍷 FineWeb line dedup</em></li>
499
  </ul>
500
  <ul>
501
  <li>same as above, but only removing duplicate lines with at least 10
502
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
503
+ dropped, 2.9T left) β€” <em>🍷 FineWeb line dedup w/ min words</em></li>
504
  </ul>
505
  <ul>
506
  <li>remove all but 1 occurrence of each span of 3 duplicated lines
 
533
  benchmark, one of the benchmarks in our β€œearly signal” group with the stronger signal and highest
534
  signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
535
  the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
536
+ each of the different filters used in C4 to a baseline of the independently deduped 🍷 FineWeb 2019-18 dump:</p>
537
  <div class="main-plot-container">
538
  <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
539
  <div id="plot-c4_filters_hellaswag"></div>
 
618
  <div id="plot-custom-filters"></div>
619
  </div>
620
  <h2>The final dataset</h2>
621
+ <p>The final 🍷 FineWeb dataset comprises 15T tokens and
622
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
623
  of benchmark tasks:</p>
624
  <ul>
 
675
  <div id="plot-dataset_ablations"></div>
676
  </div>
677
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
678
+ 🍷 FineWeb:</p>
679
  <figure><img src="plots/Untitled%203.png"/></figure>
680
  <h2>πŸ“š FineWeb-Edu</h2>
681
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
 
683
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
684
  <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
685
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
686
+ <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create πŸ“š FineWeb-Edu.</p>
687
  <h3>Annotation</h3>
688
+ <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
689
  <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
690
  <div style="text-align: center; margin: 20px 0;">
691
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
692
+ <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
 
 
 
 
693
  </div>
694
+ <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
695
  <h3>Classifier Training</h3>
696
+ <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 50,000 samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
697
+ <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
698
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
699
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
700
+ <h3>Filtering and results</h3>
701
+ <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
702
+ <p><strong>TODO: add the plot</strong></p>
703
+ <p>We then built πŸ“š FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
704
+ <p><strong>TODO: add the plot</strong></p>
705
+ <p>Here are the key highlights of the ablation results above:</p>
706
+ <ul>
707
+ <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
708
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
 
709
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
710
  </ul>
711
+ <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
712
+ <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
713
+ <p><strong>TODO: add dataset links and a collection</strong></p>
714
  <h2>Next steps</h2>
715
  <p>We want to continue improving FineWeb and will also
716
  release a technical report with more details soon.</p>
src/distill.js ADDED
The diff for this file is too large to render. See raw diff
 
style.css CHANGED
@@ -120,3 +120,16 @@
120
  display: flex !important;
121
  }
122
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  display: flex !important;
121
  }
122
  }
123
+
124
+ d-byline .byline {
125
+ grid-template-columns: 1fr;
126
+ grid-column: text;
127
+ font-size: 0.9rem;
128
+ line-height: 1.8em;
129
+ }
130
+
131
+ @media (min-width: 768px) {
132
+ d-byline .byline {
133
+ grid-template-columns: 5fr 1fr 1fr;
134
+ }
135
+ }