guipenedo HF staff commited on
Commit
7c87bf2
Β·
1 Parent(s): 016974c

formatting changes

Browse files
Files changed (4) hide show
  1. bibliography.bib +19 -0
  2. index.html +44 -40
  3. src/distill.js +0 -0
  4. style.css +11 -2
bibliography.bib CHANGED
@@ -190,4 +190,23 @@
190
  eprint={2205.10487},
191
  archivePrefix={arXiv},
192
  primaryClass={cs.LG}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  }
 
190
  eprint={2205.10487},
191
  archivePrefix={arXiv},
192
  primaryClass={cs.LG}
193
+ }
194
+ @article{llama3modelcard,
195
+
196
+ title={Llama 3 Model Card},
197
+
198
+ author={AI@Meta},
199
+
200
+ year={2024},
201
+
202
+ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
203
+
204
+ }
205
+ @misc{jiang2024mixtral,
206
+ title={Mixtral of Experts},
207
+ author={Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and LΓ©lio Renard Lavaud and Lucile Saulnier and Marie-Anne Lachaux and Pierre Stock and Sandeep Subramanian and Sophia Yang and Szymon Antoniak and Teven Le Scao and ThΓ©ophile Gervet and Thibaut Lavril and Thomas Wang and TimothΓ©e Lacroix and William El Sayed},
208
+ year={2024},
209
+ eprint={2401.04088},
210
+ archivePrefix={arXiv},
211
+ primaryClass={cs.LG}
212
  }
index.html CHANGED
@@ -1,7 +1,7 @@
1
  <!doctype html>
2
 
3
  <head>
4
- <script src="https://distill.pub/template.v2.js"></script>
5
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
6
  <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
7
  <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
@@ -120,27 +120,19 @@
120
  <body>
121
  <d-front-matter>
122
  <script id='distill-front-matter' type="text/json">{
123
- "title": "FineWeb: 15T tokens of high quality web data",
124
- "description": "This blog covers the FineWeb recipe, why more deduplication is not always better and some interesting findings on the difference in quality of CommonCrawl dumps.",
125
  "published": "May 28, 2024",
 
126
  "authors": [
127
  {
128
  "author":"Guilherme Penedo",
129
- "authorURL":"https://huggingface.co/guipenedo",
130
- "affiliations": [{"name": "HuggingFace"}]
131
  },
132
  {
133
  "author":"Hynek Kydlíček",
134
  "authorURL":"https://huggingface.co/hynky"
135
  },
136
- {
137
- "author":"Leandro Werra",
138
- "authorURL":"https://huggingface.co/lvwerra"
139
- },
140
- {
141
- "author":"Thomas Wolf",
142
- "authorURL":"https://huggingface.co/thomwolf"
143
- },
144
  {
145
  "author":"Loubna Ben Allal",
146
  "authorURL":"https://huggingface.co/loubnabnl"
@@ -148,6 +140,18 @@
148
  {
149
  "author":"Anton Lozhkov",
150
  "authorURL":"https://huggingface.co/anton-l"
 
 
 
 
 
 
 
 
 
 
 
 
151
  }
152
  ],
153
  "katex": {
@@ -169,18 +173,18 @@
169
  </d-contents>
170
 
171
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
172
- (<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
173
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
174
- <p>[TODO: ADD MORE INTRODUCTION]</p>
175
- <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>πŸ“š FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
 
176
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
177
 
178
- <p>As 🍷FineWeb has gathered a lot of interest from the
179
  community, we decided to further explain the steps involved in creating it, our processing decisions and
180
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
181
- <p><strong>TLDR:</strong> This blog covers the FineWeb
182
- recipe, why more deduplication is not always better and some interesting findings on the difference in
183
- quality of CommonCrawl dumps.</p>
184
 
185
  <h2>General considerations on web data</h2>
186
  <h3>Sourcing the data</h3>
@@ -196,13 +200,13 @@
196
  <li>you use a public repository of crawled webpages, like the one maintained by
197
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
198
  </ul>
199
- <p>For FineWeb, similarly to what was done for a large number
200
  of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
201
- They have been crawling the web since 2007 (long before LLMs were a thing) and release a new dump usually
202
  every 1 or 2 months, which can be freely downloaded. </p>
203
- <p>As an example, their latest crawl (2024-10) contains 3.16
204
- billion web pages, totaling 424.7 TiB of uncompressed HTML text content (the size changes from dump to dump). There
205
- are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
206
  <h3>Processing at scale</h3>
207
  <p>Given the sheer size of the data involved, one of the main
208
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
@@ -211,7 +215,7 @@
211
  <p>For this purpose, we developed <a
212
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
213
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
214
- CPU cores. All the data processing steps involved in the creation of FineWeb used this <a
215
  href="https://github.com/huggingface/datatrove">library</a>.</p>
216
  <h3>What is clean, good data?</h3>
217
  <p>This is probably the main question to keep in mind when
@@ -319,7 +323,7 @@
319
  </li>
320
  </ul>
321
  <p>After applying this filtering to each of the text
322
- extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
323
  tokenized with the <code>gpt2</code> tokenizer).</p>
324
  <h3>Deduplication</h3>
325
  <p>Deduplication is another important step, specially for web
@@ -356,7 +360,7 @@
356
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
357
  <h4>More deduplication is always better, right?</h4>
358
  <p>Our initial approach was to take the entire dataset (all
359
- 95 dumps) and deduplicate them as one big dataset using MinHash.</p>
360
  <p>We did this in an iterative manner: starting with the most
361
  recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
362
  not only against itself but also by removing any matches with duplicates from the previously processed
@@ -480,18 +484,18 @@
480
  independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
481
  <ul>
482
  <li>URL deduplication, where we only kept one document per normalized
483
- (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>FineWeb URL dedup</em></li>
484
  </ul>
485
  <ul>
486
  <li>Line deduplication:
487
  <ul>
488
  <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
489
- tokens dropped, 4.4T left) β€” <em>FineWeb line dedup</em></li>
490
  </ul>
491
  <ul>
492
  <li>same as above, but only removing duplicate lines with at least 10
493
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
494
- dropped, 2.9T left) β€” <em>FineWeb line dedup w/ min words</em></li>
495
  </ul>
496
  <ul>
497
  <li>remove all but 1 occurrence of each span of 3 duplicated lines
@@ -524,7 +528,7 @@
524
  benchmark, one of the benchmarks in our β€œearly signal” group with the stronger signal and highest
525
  signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
526
  the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
527
- each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump:</p>
528
  <div class="main-plot-container">
529
  <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
530
  <div id="plot-c4_filters_hellaswag"></div>
@@ -609,7 +613,7 @@
609
  <div id="plot-custom-filters"></div>
610
  </div>
611
  <h2>The final dataset</h2>
612
- <p>The final FineWeb dataset comprises 15T tokens and
613
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
614
  of benchmark tasks:</p>
615
  <ul>
@@ -666,7 +670,7 @@
666
  <div id="plot-dataset_ablations"></div>
667
  </div>
668
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
669
- FineWeb:</p>
670
  <figure><img src="plots/Untitled%203.png"/></figure>
671
  <h2>πŸ“š FineWeb-Edu</h2>
672
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
@@ -674,9 +678,9 @@
674
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
675
  <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
676
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
677
- <p>However, these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create FineWeb-Edu.</p>
678
  <h3>Annotation</h3>
679
- <p>We used Llama3-70B-Instruct to annotate 500k samples from the FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
680
  <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
681
  <div style="text-align: center; margin: 20px 0;">
682
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
@@ -689,15 +693,15 @@
689
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
690
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
691
  <h3>Filtering</h3>
692
- <p>We applied the classifier to the 15T tokens of FineWeb, a process that required 6,000 H100 GPU hours. To build FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
693
  <ul>
694
- <li>FineWeb-Edu surpasses FineWeb and all other web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
695
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
696
  <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
697
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
698
  </ul>
699
- <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
700
- <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
701
  <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
702
  <h2>Next steps</h2>
703
  <p>We want to continue improving FineWeb and will also
 
1
  <!doctype html>
2
 
3
  <head>
4
+ <script src="src/distill.js"></script>
5
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjs/12.4.2/math.min.js" charset="utf-8"></script>
6
  <script src="https://cdn.plot.ly/plotly-2.32.0.min.js" charset="utf-8"></script>
7
  <script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.21/lodash.min.js" charset="utf-8"></script>
 
120
  <body>
121
  <d-front-matter>
122
  <script id='distill-front-matter' type="text/json">{
123
+ "title": "🍷 FineWeb: decanting the web for the finest text data at scale",
124
+ "description": "This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb recipe (listing and explaining all of our design choices), and the process followed to create πŸ“š FineWeb-Edu.",
125
  "published": "May 28, 2024",
126
+ "affiliation": {"name": "HuggingFace"},
127
  "authors": [
128
  {
129
  "author":"Guilherme Penedo",
130
+ "authorURL":"https://huggingface.co/guipenedo"
 
131
  },
132
  {
133
  "author":"Hynek Kydlíček",
134
  "authorURL":"https://huggingface.co/hynky"
135
  },
 
 
 
 
 
 
 
 
136
  {
137
  "author":"Loubna Ben Allal",
138
  "authorURL":"https://huggingface.co/loubnabnl"
 
140
  {
141
  "author":"Anton Lozhkov",
142
  "authorURL":"https://huggingface.co/anton-l"
143
+ },
144
+ {
145
+ "author":"Colin Raffel",
146
+ "authorURL":"https://huggingface.co/craffel"
147
+ },
148
+ {
149
+ "author":"Leandro Werra",
150
+ "authorURL":"https://huggingface.co/lvwerra"
151
+ },
152
+ {
153
+ "author":"Thomas Wolf",
154
+ "authorURL":"https://huggingface.co/thomwolf"
155
  }
156
  ],
157
  "katex": {
 
173
  </d-contents>
174
 
175
  <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
176
+ (<strong>15T gpt2 tokens, 44TB disk space</strong>) dataset of clean text sourced from the web for LLM pretraining. You can
177
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
178
+ <p>The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Mixtral<d-cite bibtex-key="jiang2024mixtral"></d-cite> are not publicly available and very little is known about how they were created.</p>
179
+ <p>🍷 FineWeb, a 15-trillion token dataset derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots, produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies.</p>
180
+ <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>πŸ“š FineWeb-Edu</strong></a>, a version of 🍷 FineWeb that was filtered for educational content, available in two sizes: <strong>1.3 trillion (very high quality) and 5.4 trillion (high quality) tokens</strong>. πŸ“š FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
181
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
182
 
183
+ <p>As 🍷 FineWeb has gathered a lot of interest from the
184
  community, we decided to further explain the steps involved in creating it, our processing decisions and
185
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
186
+ <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
187
+ recipe (listing and explaining all of our design choices), and the process followed to create πŸ“š FineWeb-Edu.</p>
 
188
 
189
  <h2>General considerations on web data</h2>
190
  <h3>Sourcing the data</h3>
 
200
  <li>you use a public repository of crawled webpages, like the one maintained by
201
  the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
202
  </ul>
203
+ <p>For 🍷 FineWeb, similarly to what was done for a large number
204
  of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
205
+ They have been crawling the web since 2007 (long before LLMs became widespread) and release a new dump usually
206
  every 1 or 2 months, which can be freely downloaded. </p>
207
+ <p>As an example, their latest crawl (2024-18) contains 2.7
208
+ billion web pages, totaling 386 TiB of uncompressed HTML text content (the size changes from dump to dump). There
209
+ are 96 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format.<d-footnote>We have not processed these 3 older dumps.</d-footnote> </p>
210
  <h3>Processing at scale</h3>
211
  <p>Given the sheer size of the data involved, one of the main
212
  challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
 
215
  <p>For this purpose, we developed <a
216
  href="https://github.com/huggingface/datatrove"><code>datatrove</code></a><d-cite bibtex-key="penedo2024datatrove"></d-cite>, an open-source data
217
  processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
218
+ CPU cores. All the data processing steps involved in the creation of 🍷 FineWeb used this <a
219
  href="https://github.com/huggingface/datatrove">library</a>.</p>
220
  <h3>What is clean, good data?</h3>
221
  <p>This is probably the main question to keep in mind when
 
323
  </li>
324
  </ul>
325
  <p>After applying this filtering to each of the text
326
+ extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data (when
327
  tokenized with the <code>gpt2</code> tokenizer).</p>
328
  <h3>Deduplication</h3>
329
  <p>Deduplication is another important step, specially for web
 
360
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
361
  <h4>More deduplication is always better, right?</h4>
362
  <p>Our initial approach was to take the entire dataset (all
363
+ 96 dumps) and deduplicate them as one big dataset using MinHash.</p>
364
  <p>We did this in an iterative manner: starting with the most
365
  recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
366
  not only against itself but also by removing any matches with duplicates from the previously processed
 
484
  independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all crawls) with the following methods</p>
485
  <ul>
486
  <li>URL deduplication, where we only kept one document per normalized
487
+ (lowercased) URL (71.5% of tokens removed, 5.6T left) β€” <em>🍷 FineWeb URL dedup</em></li>
488
  </ul>
489
  <ul>
490
  <li>Line deduplication:
491
  <ul>
492
  <li>remove all but 1 (randomly chosen) occurrence of each duplicated line (77.8% of
493
+ tokens dropped, 4.4T left) β€” <em>🍷 FineWeb line dedup</em></li>
494
  </ul>
495
  <ul>
496
  <li>same as above, but only removing duplicate lines with at least 10
497
  words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
498
+ dropped, 2.9T left) β€” <em>🍷 FineWeb line dedup w/ min words</em></li>
499
  </ul>
500
  <ul>
501
  <li>remove all but 1 occurrence of each span of 3 duplicated lines
 
528
  benchmark, one of the benchmarks in our β€œearly signal” group with the stronger signal and highest
529
  signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in
530
  the relatively recent Llama1 model<d-cite bibtex-key="touvron2023llama"></d-cite>. We experimented applying
531
+ each of the different filters used in C4 to a baseline of the independently deduped 🍷 FineWeb 2019-18 dump:</p>
532
  <div class="main-plot-container">
533
  <figure><img src="plots/c4_filters_hellaswag.png"/></figure>
534
  <div id="plot-c4_filters_hellaswag"></div>
 
613
  <div id="plot-custom-filters"></div>
614
  </div>
615
  <h2>The final dataset</h2>
616
+ <p>The final 🍷 FineWeb dataset comprises 15T tokens and
617
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
618
  of benchmark tasks:</p>
619
  <ul>
 
670
  <div id="plot-dataset_ablations"></div>
671
  </div>
672
  <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
673
+ 🍷 FineWeb:</p>
674
  <figure><img src="plots/Untitled%203.png"/></figure>
675
  <h2>πŸ“š FineWeb-Edu</h2>
676
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
 
678
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
679
  <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
680
  <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
681
+ <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create πŸ“š FineWeb-Edu.</p>
682
  <h3>Annotation</h3>
683
+ <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
684
  <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
685
  <div style="text-align: center; margin: 20px 0;">
686
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
 
693
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
694
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
695
  <h3>Filtering</h3>
696
+ <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. To build πŸ“š FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
697
  <ul>
698
+ <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
699
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
700
  <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
701
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
702
  </ul>
703
+ <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the 🍷 FineWeb dataset, with performance just slightly below that of threshold 3.</p>
704
+ <p>We release these two datasets as πŸ“š FineWeb-Edu and πŸ“š FineWeb-edu-Large along with the classifier used for the filtering.</p>
705
  <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
706
  <h2>Next steps</h2>
707
  <p>We want to continue improving FineWeb and will also
src/distill.js ADDED
The diff for this file is too large to render. See raw diff
 
style.css CHANGED
@@ -121,6 +121,15 @@
121
  }
122
  }
123
 
 
 
 
 
 
 
124
 
125
-
126
-
 
 
 
 
121
  }
122
  }
123
 
124
+ d-byline .byline {
125
+ grid-template-columns: 1fr;
126
+ grid-column: text;
127
+ font-size: 0.9rem;
128
+ line-height: 1.8em;
129
+ }
130
 
131
+ @media (min-width: 768px) {
132
+ d-byline .byline {
133
+ grid-template-columns: 5fr 1fr 1fr;
134
+ }
135
+ }