thomwolf HF Staff commited on
Commit
48424a8
·
1 Parent(s): 0df1973
Files changed (2) hide show
  1. dist/index.html +51 -55
  2. src/index.html +48 -53
dist/index.html CHANGED
@@ -108,7 +108,7 @@
108
  releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
109
  every 1 or 2 months. </p>
110
  <p>As an example, the latest CC crawl (April 2024) contains 2.7
111
- billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl</d-footnote>.
112
  Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
113
  <d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
114
 
@@ -150,7 +150,7 @@
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our ablation models have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
@@ -161,7 +161,7 @@
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
- resulting scores to have as little sensitivity to exact data choice as possible (apart from larger ablations that we are concerned with)
165
  </li>
166
  </ul>
167
  <ul>
@@ -204,10 +204,10 @@
204
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
205
  version of those websites.</p>
206
  <p>A large number of datasets take the WET files as their
207
- starting point. In our experience the default text extraction (extracting the main text of a webpage from
208
- its HTML) used to create these WET files is suboptimal and there are a variety of open-source libraries that
209
- provide better text extraction (by, namely, keeping less boilerplate content/navigation menus). We extracted
210
- the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p><aside>You can also find a benchmark on text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
211
  <p>To validate this decision, we processed the 2019-18 dump
212
  directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
213
  processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
@@ -221,10 +221,11 @@
221
  <figure><img src="assets/images/wet_comparison.png"/></figure>
222
  <div id="plot-wet_comparison"></div>
223
  </div>
 
224
  <h3>Base filtering</h3>
225
- <p>Filtering is an important part of the curation process. It
226
- removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
227
- deemed to be “lower quality”.</p>
228
  <p>As a basis for our filtering we used part of the setup
229
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
230
  <ul>
@@ -243,26 +244,25 @@
243
  </li>
244
  </ul>
245
  <p>After applying this filtering to each of the text
246
- extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data (when
247
- tokenized with the <code>gpt2</code> tokenizer).</p>
248
  <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
252
- just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
253
- can be introduced by the crawler itself, when different links point to the same page. </p>
254
- <p>Removing these duplicates (deduplicating) has been linked to an improvement in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
- allow for better generalization. Additionally, the performance uplift obtained through deduplication can also be tied to increased training
256
- efficiency: by removing duplicated content, for the same number of training tokens, a model will have seen
257
- more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
258
  <p>There are different ways to identify and even define
259
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
260
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
261
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
262
- documents (or lines, paragraphs, or whatever other granularity level being used).</p>
 
263
  <h4>Our deduplication parameters</h4>
264
- <p>Similarly to RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
- fuzzy hash based deduplication technique that scales well and allows us to tune similarity thresholds (by changing the number and size of buckets) and the granularity of the matches (by changing the n-gram size). We chose to compute minhashes on each document’s 5-grams, using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
@@ -278,28 +278,24 @@
278
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
279
  trade off.</p>
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
 
281
  <h4>More deduplication is always better, right?</h4>
282
- <p>Our initial approach was to take the entire dataset (all
283
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
284
  <p>We did this in an iterative manner: starting with the most
285
- recent dump (which at the time was 2023-50) and proceeding chronologically until the oldest one, we would deduplicate each dump
286
- not only within itself, but we would also remove any matches with documents from the previously processed (more recent)
287
  dumps. </p>
288
  <p>For instance, for the second most recent dump (2023-40 at
289
- the time), we deduplicated it against the most recent one in addition to within itself. In particular, the oldest
290
- dump was deduplicated against all other dumps. As a result, more data was removed from the oldest dumps (last
291
- to be deduplicated) than from the most recent ones.</p>
292
  <p>Deduplicating the dataset in this manner resulted in 4
293
- trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
294
- tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
295
- green curves below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
296
  <div class="main-plot-container">
297
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
298
  <div id="plot-all_dumps_bad"></div>
299
  </div>
300
- <p>This was quite puzzling as our intuition regarding web
301
- data was that more deduplication would always result in improved performance. We decided to take a closer
302
- look at one of the oldest dumps, dump 2013-48:</p>
303
  <ul>
304
  <li>pre deduplication, this dump had ~490 billion tokens</li>
305
  </ul>
@@ -326,23 +322,22 @@
326
  <figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
327
  <div id="plot-removed_data_dedup"></div>
328
  </div>
329
- <p>These results show that, for this older dump from which we had
330
- removed over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
331
- removed (considered independently of all the other dumps). This is also confirmed by visual inspection: <em>originally kept
332
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
333
  <h4>Taking a step back: individual dump dedup</h4>
334
- <p>We then tried an alternative approach: we deduplicated
335
- each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
336
  tokens of data.</p>
337
  <p>When training on a random sample from this dataset we see
338
- that it now matches RefinedWeb’s performance (blue and red curves below):</p>
339
  <div class="main-plot-container">
340
  <figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
341
  <div id="plot-ind_dedup_better"></div>
342
  </div>
343
  <p>We hypothesize that the main improvement gained from
344
  deduplication is the removal of very large clusters that are present in every single dump (you will find
345
- some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
346
  documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
347
  of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
348
  actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
@@ -353,7 +348,8 @@
353
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
354
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
355
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
356
- <h4>A note on measuring the effect of deduplication</h4>
 
357
  <p>Given the nature of deduplication, its effect is not
358
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
359
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
@@ -366,7 +362,7 @@
366
  </ul>
367
  <ul>
368
  <li>each dump has been perfectly individually deduplicated (every single
369
- document in it is unique)
370
  </li>
371
  </ul>
372
  <ul>
@@ -401,9 +397,10 @@
401
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
402
  measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
403
  removed.</p>
 
404
  <h4>Other (failed) global approaches</h4>
405
- <p>We attempted to improve the performance of the
406
- independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all dumps) with the following methods:</p>
407
  <ul>
408
  <li>URL deduplication, where we only kept one document per normalized
409
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
@@ -433,10 +430,10 @@
433
  <figure><img src="assets/images/dedup_attempts.png"/></figure>
434
  <div id="plot-dedup_attempts"></div>
435
  </div>
 
436
  <h3>Additional filtering</h3>
437
- <p>By this point we had reached the same performance as
438
- RefinedWeb with base filtering + independent MinHash, but on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performance (with
439
- the caveat that it is a relatively small dataset for current web-scale standards).</p>
440
  <p>We therefore set out to find new filtering steps that
441
  would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
442
  was to look into the processing of C4 itself.</p>
@@ -457,8 +454,7 @@
457
  <ul>
458
  <li>applying “All filters” (drop lines not ending on punctuation marks,
459
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
460
- ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
461
- pink curves).
462
  </li>
463
  </ul>
464
  <ul>
@@ -486,14 +482,13 @@
486
  the next section.</p>
487
  <h4>A statistical approach to develop heuristic filters</h4>
488
  <p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
489
- <ol><li>we started by collecting a very large list of high level statistics (over <strong>50</strong>) ranging from common document-level
490
- metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
491
- inspired), on both a high quality and a lower quality web dataset;</li>
492
  <li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
493
  <li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
494
  <li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
495
  </ol>
496
- <p>Due to our assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
497
  MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
498
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
499
  <p>Perhaps not too surprisingly given our findings for deduplication, we found significant
@@ -611,7 +606,7 @@
611
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
612
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
613
  </div>
614
- <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
615
  <h3>Classifier Training</h3>
616
  <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
617
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
@@ -624,7 +619,8 @@
624
  </figure>
625
  <div id="plot-edu-8k"></div>
626
  </div>
627
- <p>We then built 📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3 trillion educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
 
628
  <div class="main-plot-container">
629
  <figure>
630
  <img src="assets/images/edu-100k.png">
 
108
  releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
109
  every 1 or 2 months. </p>
110
  <p>As an example, the latest CC crawl (April 2024) contains 2.7
111
+ billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl. Note also that we use "dump" or "crawl" interchangeability in this report.</d-footnote>.
112
  Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
113
  <d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
114
 
 
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
+ INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
+ resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
165
  </li>
166
  </ul>
167
  <ul>
 
204
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
205
  version of those websites.</p>
206
  <p>A large number of datasets take the WET files as their
207
+ starting point. In our experience the default text extraction used by Common Crawl to create these WET files is suboptimal for the goals of LLM pretraining<d-footnote>In particular we suspect that it keeps too much boilerplate content and navigation menus.</d-footnote> and there are a variety of open-source libraries that
208
+ provide better text extraction. We extracted
209
+ the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p>
210
+ <aside>You can find a benchmark comparing several text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
211
  <p>To validate this decision, we processed the 2019-18 dump
212
  directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
213
  processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
 
221
  <figure><img src="assets/images/wet_comparison.png"/></figure>
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
+
225
  <h3>Base filtering</h3>
226
+ <p>Filtering is an important part of the curation process. It consists in
227
+ removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
+ deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
231
  <ul>
 
244
  </li>
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
+ extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
 
248
  <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
252
+ just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
+ can even be introduced by the crawler itself, when different links point to the same page. </p>
254
+ <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
+ allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
256
+ efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
 
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
+ documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
+
263
  <h4>Our deduplication parameters</h4>
264
+ <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
+ fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
 
278
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
279
  trade off.</p>
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
+
282
  <h4>More deduplication is always better, right?</h4>
283
+ <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
+ recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
+ not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
+ the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
 
 
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
+ trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
+ tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
 
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
+ <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
 
 
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
 
322
  <figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
323
  <div id="plot-removed_data_dedup"></div>
324
  </div>
325
+ <p>These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually <em>worse</em> than the 90% of data we
326
+ removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
 
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
+ <p>We decided to experimence with alternative approaches: we deduplicated
330
+ each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
333
+ that it now matches RefinedWeb’s performance (see curves below):</p>
334
  <div class="main-plot-container">
335
  <figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
336
  <div id="plot-ind_dedup_better"></div>
337
  </div>
338
  <p>We hypothesize that the main improvement gained from
339
  deduplication is the removal of very large clusters that are present in every single dump (you will find
340
+ some examples of these clusters in the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
341
  documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
342
  of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
343
  actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
 
348
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
349
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
350
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
351
+
352
+ <h4>A note on measuring the effect of deduplication</h4>
353
  <p>Given the nature of deduplication, its effect is not
354
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
355
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
 
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
+ document in a is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
 
397
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
398
  measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
399
  removed.</p>
400
+
401
  <h4>Other (failed) global approaches</h4>
402
+ <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
403
+ independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
 
430
  <figure><img src="assets/images/dedup_attempts.png"/></figure>
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
+
434
  <h3>Additional filtering</h3>
435
+ <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
+ RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
 
437
  <p>We therefore set out to find new filtering steps that
438
  would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
439
  was to look into the processing of C4 itself.</p>
 
454
  <ul>
455
  <li>applying “All filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
457
+ ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
 
458
  </li>
459
  </ul>
460
  <ul>
 
482
  the next section.</p>
483
  <h4>A statistical approach to develop heuristic filters</h4>
484
  <p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
485
+ <ol><li>we started by collecting a very large list of high level statistics of our datasets (over <strong>fifty</strong> different metrics) ranging from common document-level
486
+ metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (inspired by MassiveText), on both a high quality and a lower quality web dataset;</li>
 
487
  <li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
488
  <li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
489
  <li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
490
  </ol>
491
+ <p>Due to our (new) assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
492
  MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
493
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
494
  <p>Perhaps not too surprisingly given our findings for deduplication, we found significant
 
606
  <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
607
  <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
608
  </div>
609
+ <p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
610
  <h3>Classifier Training</h3>
611
  <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
612
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
 
619
  </figure>
620
  <div id="plot-edu-8k"></div>
621
  </div>
622
+ <p><strong>Note:</strong> this ablation was conducted on 8B tokens from the 2024-10 dump for both the FineWeb and FineWeb-Edu subsets, which might not be representative of the entire dataset. The next ablation shows that the findings for threshold 3 hold on a longer run of 350B tokens from all FineWeb dumps, except for HellaSwag, where we noticed a slight performance degradation.</p>
623
+ <p>We built 📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3 trillion educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
624
  <div class="main-plot-container">
625
  <figure>
626
  <img src="assets/images/edu-100k.png">
src/index.html CHANGED
@@ -108,7 +108,7 @@
108
  releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
109
  every 1 or 2 months. </p>
110
  <p>As an example, the latest CC crawl (April 2024) contains 2.7
111
- billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl</d-footnote>.
112
  Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
113
  <d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
114
 
@@ -150,7 +150,7 @@
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
- INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our ablation models have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
@@ -161,7 +161,7 @@
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
- resulting scores to have as little sensitivity to exact data choice as possible (apart from larger ablations that we are concerned with)
165
  </li>
166
  </ul>
167
  <ul>
@@ -204,10 +204,10 @@
204
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
205
  version of those websites.</p>
206
  <p>A large number of datasets take the WET files as their
207
- starting point. In our experience the default text extraction (extracting the main text of a webpage from
208
- its HTML) used to create these WET files is suboptimal and there are a variety of open-source libraries that
209
- provide better text extraction (by, namely, keeping less boilerplate content/navigation menus). We extracted
210
- the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p><aside>You can also find a benchmark on text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
211
  <p>To validate this decision, we processed the 2019-18 dump
212
  directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
213
  processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
@@ -221,10 +221,11 @@
221
  <figure><img src="assets/images/wet_comparison.png"/></figure>
222
  <div id="plot-wet_comparison"></div>
223
  </div>
 
224
  <h3>Base filtering</h3>
225
- <p>Filtering is an important part of the curation process. It
226
- removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
227
- deemed to be “lower quality”.</p>
228
  <p>As a basis for our filtering we used part of the setup
229
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
230
  <ul>
@@ -243,26 +244,25 @@
243
  </li>
244
  </ul>
245
  <p>After applying this filtering to each of the text
246
- extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data (when
247
- tokenized with the <code>gpt2</code> tokenizer).</p>
248
  <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
252
- just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
253
- can be introduced by the crawler itself, when different links point to the same page. </p>
254
- <p>Removing these duplicates (deduplicating) has been linked to an improvement in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
- allow for better generalization. Additionally, the performance uplift obtained through deduplication can also be tied to increased training
256
- efficiency: by removing duplicated content, for the same number of training tokens, a model will have seen
257
- more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
258
  <p>There are different ways to identify and even define
259
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
260
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
261
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
262
- documents (or lines, paragraphs, or whatever other granularity level being used).</p>
 
263
  <h4>Our deduplication parameters</h4>
264
- <p>Similarly to RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
- fuzzy hash based deduplication technique that scales well and allows us to tune similarity thresholds (by changing the number and size of buckets) and the granularity of the matches (by changing the n-gram size). We chose to compute minhashes on each document’s 5-grams, using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
@@ -278,28 +278,24 @@
278
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
279
  trade off.</p>
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
 
281
  <h4>More deduplication is always better, right?</h4>
282
- <p>Our initial approach was to take the entire dataset (all
283
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
284
  <p>We did this in an iterative manner: starting with the most
285
- recent dump (which at the time was 2023-50) and proceeding chronologically until the oldest one, we would deduplicate each dump
286
- not only within itself, but we would also remove any matches with documents from the previously processed (more recent)
287
  dumps. </p>
288
  <p>For instance, for the second most recent dump (2023-40 at
289
- the time), we deduplicated it against the most recent one in addition to within itself. In particular, the oldest
290
- dump was deduplicated against all other dumps. As a result, more data was removed from the oldest dumps (last
291
- to be deduplicated) than from the most recent ones.</p>
292
  <p>Deduplicating the dataset in this manner resulted in 4
293
- trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
294
- tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
295
- green curves below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
296
  <div class="main-plot-container">
297
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
298
  <div id="plot-all_dumps_bad"></div>
299
  </div>
300
- <p>This was quite puzzling as our intuition regarding web
301
- data was that more deduplication would always result in improved performance. We decided to take a closer
302
- look at one of the oldest dumps, dump 2013-48:</p>
303
  <ul>
304
  <li>pre deduplication, this dump had ~490 billion tokens</li>
305
  </ul>
@@ -326,23 +322,22 @@
326
  <figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
327
  <div id="plot-removed_data_dedup"></div>
328
  </div>
329
- <p>These results show that, for this older dump from which we had
330
- removed over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
331
- removed (considered independently of all the other dumps). This is also confirmed by visual inspection: <em>originally kept
332
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
333
  <h4>Taking a step back: individual dump dedup</h4>
334
- <p>We then tried an alternative approach: we deduplicated
335
- each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
336
  tokens of data.</p>
337
  <p>When training on a random sample from this dataset we see
338
- that it now matches RefinedWeb’s performance (blue and red curves below):</p>
339
  <div class="main-plot-container">
340
  <figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
341
  <div id="plot-ind_dedup_better"></div>
342
  </div>
343
  <p>We hypothesize that the main improvement gained from
344
  deduplication is the removal of very large clusters that are present in every single dump (you will find
345
- some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
346
  documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
347
  of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
348
  actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
@@ -353,7 +348,8 @@
353
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
354
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
355
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
356
- <h4>A note on measuring the effect of deduplication</h4>
 
357
  <p>Given the nature of deduplication, its effect is not
358
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
359
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
@@ -366,7 +362,7 @@
366
  </ul>
367
  <ul>
368
  <li>each dump has been perfectly individually deduplicated (every single
369
- document in it is unique)
370
  </li>
371
  </ul>
372
  <ul>
@@ -401,9 +397,10 @@
401
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
402
  measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
403
  removed.</p>
 
404
  <h4>Other (failed) global approaches</h4>
405
- <p>We attempted to improve the performance of the
406
- independently minhash deduped 20 trillion tokens of data by further deduplicating it (globally, over all dumps) with the following methods:</p>
407
  <ul>
408
  <li>URL deduplication, where we only kept one document per normalized
409
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
@@ -433,10 +430,10 @@
433
  <figure><img src="assets/images/dedup_attempts.png"/></figure>
434
  <div id="plot-dedup_attempts"></div>
435
  </div>
 
436
  <h3>Additional filtering</h3>
437
- <p>By this point we had reached the same performance as
438
- RefinedWeb with base filtering + independent MinHash, but on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performance (with
439
- the caveat that it is a relatively small dataset for current web-scale standards).</p>
440
  <p>We therefore set out to find new filtering steps that
441
  would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
442
  was to look into the processing of C4 itself.</p>
@@ -457,8 +454,7 @@
457
  <ul>
458
  <li>applying “All filters” (drop lines not ending on punctuation marks,
459
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
460
- ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
461
- pink curves).
462
  </li>
463
  </ul>
464
  <ul>
@@ -486,14 +482,13 @@
486
  the next section.</p>
487
  <h4>A statistical approach to develop heuristic filters</h4>
488
  <p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
489
- <ol><li>we started by collecting a very large list of high level statistics (over <strong>50</strong>) ranging from common document-level
490
- metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
491
- inspired), on both a high quality and a lower quality web dataset;</li>
492
  <li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
493
  <li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
494
  <li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
495
  </ol>
496
- <p>Due to our assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
497
  MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
498
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
499
  <p>Perhaps not too surprisingly given our findings for deduplication, we found significant
 
108
  releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
109
  every 1 or 2 months. </p>
110
  <p>As an example, the latest CC crawl (April 2024) contains 2.7
111
+ billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl. Note also that we use "dump" or "crawl" interchangeability in this report.</d-footnote>.
112
  Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
113
  <d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
114
 
 
150
  scores.</p>
151
  <p>Our ablation models were trained using <a
152
  href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
153
+ INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
154
  architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
155
  ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
156
  model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
 
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
+ resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
165
  </li>
166
  </ul>
167
  <ul>
 
204
  full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
205
  version of those websites.</p>
206
  <p>A large number of datasets take the WET files as their
207
+ starting point. In our experience the default text extraction used by Common Crawl to create these WET files is suboptimal for the goals of LLM pretraining<d-footnote>In particular we suspect that it keeps too much boilerplate content and navigation menus.</d-footnote> and there are a variety of open-source libraries that
208
+ provide better text extraction. We extracted
209
+ the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p>
210
+ <aside>You can find a benchmark comparing several text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
211
  <p>To validate this decision, we processed the 2019-18 dump
212
  directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
213
  processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
 
221
  <figure><img src="assets/images/wet_comparison.png"/></figure>
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
+
225
  <h3>Base filtering</h3>
226
+ <p>Filtering is an important part of the curation process. It consists in
227
+ removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
+ deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
231
  <ul>
 
244
  </li>
245
  </ul>
246
  <p>After applying this filtering to each of the text
247
+ extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
 
248
  <h3>Deduplication</h3>
249
  <p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
250
  <h4>Why deduplicate?</h4>
251
  <p>The web has many aggregators, mirrors, templated pages or
252
+ just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
+ can even be introduced by the crawler itself, when different links point to the same page. </p>
254
+ <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
+ allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
256
+ efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
 
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
+ documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
+
263
  <h4>Our deduplication parameters</h4>
264
+ <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
+ fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
 
278
  allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
279
  trade off.</p>
280
  <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
281
+
282
  <h4>More deduplication is always better, right?</h4>
283
+ <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
284
  90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
285
  <p>We did this in an iterative manner: starting with the most
286
+ recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
287
+ not only within itself, but removing any document matching any other documents in the previously processed
288
  dumps. </p>
289
  <p>For instance, for the second most recent dump (2023-40 at
290
+ the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
 
 
291
  <p>Deduplicating the dataset in this manner resulted in 4
292
+ trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
293
+ tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
 
294
  <div class="main-plot-container">
295
  <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
296
  <div id="plot-all_dumps_bad"></div>
297
  </div>
298
+ <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
 
 
299
  <ul>
300
  <li>pre deduplication, this dump had ~490 billion tokens</li>
301
  </ul>
 
322
  <figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
323
  <div id="plot-removed_data_dedup"></div>
324
  </div>
325
+ <p>These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually <em>worse</em> than the 90% of data we
326
+ removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
 
327
  data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
328
  <h4>Taking a step back: individual dump dedup</h4>
329
+ <p>We decided to experimence with alternative approaches: we deduplicated
330
+ each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
331
  tokens of data.</p>
332
  <p>When training on a random sample from this dataset we see
333
+ that it now matches RefinedWeb’s performance (see curves below):</p>
334
  <div class="main-plot-container">
335
  <figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
336
  <div id="plot-ind_dedup_better"></div>
337
  </div>
338
  <p>We hypothesize that the main improvement gained from
339
  deduplication is the removal of very large clusters that are present in every single dump (you will find
340
+ some examples of these clusters in the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
341
  documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
342
  of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
343
  actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
 
348
  improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
349
  lower quality data. We also experimented with applying different, and often “lighter”, deduplication
350
  approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
351
+
352
+ <h4>A note on measuring the effect of deduplication</h4>
353
  <p>Given the nature of deduplication, its effect is not
354
  always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
355
  filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
 
362
  </ul>
363
  <ul>
364
  <li>each dump has been perfectly individually deduplicated (every single
365
+ document in a is unique in this dump)
366
  </li>
367
  </ul>
368
  <ul>
 
397
  documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
398
  measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
399
  removed.</p>
400
+
401
  <h4>Other (failed) global approaches</h4>
402
+ <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
403
+ independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
404
  <ul>
405
  <li>URL deduplication, where we only kept one document per normalized
406
  (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
 
430
  <figure><img src="assets/images/dedup_attempts.png"/></figure>
431
  <div id="plot-dedup_attempts"></div>
432
  </div>
433
+
434
  <h3>Additional filtering</h3>
435
+ <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
436
+ RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
 
437
  <p>We therefore set out to find new filtering steps that
438
  would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
439
  was to look into the processing of C4 itself.</p>
 
454
  <ul>
455
  <li>applying “All filters” (drop lines not ending on punctuation marks,
456
  mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
457
+ ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
 
458
  </li>
459
  </ul>
460
  <ul>
 
482
  the next section.</p>
483
  <h4>A statistical approach to develop heuristic filters</h4>
484
  <p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
485
+ <ol><li>we started by collecting a very large list of high level statistics of our datasets (over <strong>fifty</strong> different metrics) ranging from common document-level
486
+ metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (inspired by MassiveText), on both a high quality and a lower quality web dataset;</li>
 
487
  <li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
488
  <li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
489
  <li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
490
  </ol>
491
+ <p>Due to our (new) assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
492
  MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
493
  statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
494
  <p>Perhaps not too surprisingly given our findings for deduplication, we found significant