update
Browse files- dist/index.html +51 -55
- src/index.html +48 -53
dist/index.html
CHANGED
@@ -108,7 +108,7 @@
|
|
108 |
releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
|
109 |
every 1 or 2 months. </p>
|
110 |
<p>As an example, the latest CC crawl (April 2024) contains 2.7
|
111 |
-
billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl
|
112 |
Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
|
113 |
<d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
|
114 |
|
@@ -150,7 +150,7 @@
|
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
|
153 |
-
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our ablation models have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
@@ -161,7 +161,7 @@
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
-
resulting scores to
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
@@ -204,10 +204,10 @@
|
|
204 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
205 |
version of those websites.</p>
|
206 |
<p>A large number of datasets take the WET files as their
|
207 |
-
starting point. In our experience the default text extraction
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
<p>To validate this decision, we processed the 2019-18 dump
|
212 |
directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
|
213 |
processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
|
@@ -221,10 +221,11 @@
|
|
221 |
<figure><img src="assets/images/wet_comparison.png"/></figure>
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
|
|
224 |
<h3>Base filtering</h3>
|
225 |
-
<p>Filtering is an important part of the curation process. It
|
226 |
-
|
227 |
-
deemed to be “lower quality
|
228 |
<p>As a basis for our filtering we used part of the setup
|
229 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
230 |
<ul>
|
@@ -243,26 +244,25 @@
|
|
243 |
</li>
|
244 |
</ul>
|
245 |
<p>After applying this filtering to each of the text
|
246 |
-
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data
|
247 |
-
tokenized with the <code>gpt2</code> tokenizer).</p>
|
248 |
<h3>Deduplication</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
252 |
-
just otherwise repeated content spread over different domains and webpages.
|
253 |
-
can be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
-
<p>Removing these duplicates (deduplicating) has been
|
255 |
-
allow for better generalization. Additionally, the performance uplift obtained through deduplication can
|
256 |
-
efficiency: by removing duplicated content,
|
257 |
-
more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
258 |
<p>There are different ways to identify and even define
|
259 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
260 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
261 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
262 |
-
documents (or lines, paragraphs, or whatever other granularity level being used)
|
|
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
-
<p>
|
265 |
-
fuzzy hash based deduplication technique that scales
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
@@ -278,28 +278,24 @@
|
|
278 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
279 |
trade off.</p>
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
|
|
281 |
<h4>More deduplication is always better, right?</h4>
|
282 |
-
<p>
|
283 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
284 |
<p>We did this in an iterative manner: starting with the most
|
285 |
-
recent dump (which at the time was 2023-50) and proceeding chronologically until the oldest
|
286 |
-
not only within itself, but
|
287 |
dumps. </p>
|
288 |
<p>For instance, for the second most recent dump (2023-40 at
|
289 |
-
the time), we deduplicated it against the most recent one in addition to within itself.
|
290 |
-
dump was deduplicated against all other dumps. As a result, more data was removed from the oldest dumps (last
|
291 |
-
to be deduplicated) than from the most recent ones.</p>
|
292 |
<p>Deduplicating the dataset in this manner resulted in 4
|
293 |
-
trillion tokens of data, but, quite surprisingly
|
294 |
-
tokens subset,
|
295 |
-
green curves below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
296 |
<div class="main-plot-container">
|
297 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
298 |
<div id="plot-all_dumps_bad"></div>
|
299 |
</div>
|
300 |
-
<p>This was
|
301 |
-
data was that more deduplication would always result in improved performance. We decided to take a closer
|
302 |
-
look at one of the oldest dumps, dump 2013-48:</p>
|
303 |
<ul>
|
304 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
305 |
</ul>
|
@@ -326,23 +322,22 @@
|
|
326 |
<figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
|
327 |
<div id="plot-removed_data_dedup"></div>
|
328 |
</div>
|
329 |
-
<p>These results show that, for this older dump
|
330 |
-
removed
|
331 |
-
removed (considered independently of all the other dumps). This is also confirmed by visual inspection: <em>originally kept
|
332 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
333 |
<h4>Taking a step back: individual dump dedup</h4>
|
334 |
-
<p>We
|
335 |
-
each dump with MinHash individually (
|
336 |
tokens of data.</p>
|
337 |
<p>When training on a random sample from this dataset we see
|
338 |
-
that it now matches RefinedWeb’s performance (
|
339 |
<div class="main-plot-container">
|
340 |
<figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
|
341 |
<div id="plot-ind_dedup_better"></div>
|
342 |
</div>
|
343 |
<p>We hypothesize that the main improvement gained from
|
344 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
345 |
-
some examples of these clusters
|
346 |
documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
|
347 |
of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
|
348 |
actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
|
@@ -353,7 +348,8 @@
|
|
353 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
354 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
355 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
356 |
-
|
|
|
357 |
<p>Given the nature of deduplication, its effect is not
|
358 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
359 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
@@ -366,7 +362,7 @@
|
|
366 |
</ul>
|
367 |
<ul>
|
368 |
<li>each dump has been perfectly individually deduplicated (every single
|
369 |
-
document in
|
370 |
</li>
|
371 |
</ul>
|
372 |
<ul>
|
@@ -401,9 +397,10 @@
|
|
401 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
402 |
measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
|
403 |
removed.</p>
|
|
|
404 |
<h4>Other (failed) global approaches</h4>
|
405 |
-
<p>We attempted to improve the performance
|
406 |
-
independently minhash deduped 20 trillion tokens of data
|
407 |
<ul>
|
408 |
<li>URL deduplication, where we only kept one document per normalized
|
409 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
@@ -433,10 +430,10 @@
|
|
433 |
<figure><img src="assets/images/dedup_attempts.png"/></figure>
|
434 |
<div id="plot-dedup_attempts"></div>
|
435 |
</div>
|
|
|
436 |
<h3>Additional filtering</h3>
|
437 |
-
<p>By this point we had reached the same performance
|
438 |
-
RefinedWeb
|
439 |
-
the caveat that it is a relatively small dataset for current web-scale standards).</p>
|
440 |
<p>We therefore set out to find new filtering steps that
|
441 |
would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
|
442 |
was to look into the processing of C4 itself.</p>
|
@@ -457,8 +454,7 @@
|
|
457 |
<ul>
|
458 |
<li>applying “All filters” (drop lines not ending on punctuation marks,
|
459 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
460 |
-
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (
|
461 |
-
pink curves).
|
462 |
</li>
|
463 |
</ul>
|
464 |
<ul>
|
@@ -486,14 +482,13 @@
|
|
486 |
the next section.</p>
|
487 |
<h4>A statistical approach to develop heuristic filters</h4>
|
488 |
<p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
|
489 |
-
<ol><li>we started by collecting a very large list of high level statistics (over <strong>
|
490 |
-
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
|
491 |
-
inspired), on both a high quality and a lower quality web dataset;</li>
|
492 |
<li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
|
493 |
<li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
|
494 |
<li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
|
495 |
</ol>
|
496 |
-
<p>Due to our assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
|
497 |
MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
|
498 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
499 |
<p>Perhaps not too surprisingly given our findings for deduplication, we found significant
|
@@ -611,7 +606,7 @@
|
|
611 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
612 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
613 |
</div>
|
614 |
-
<p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-
|
615 |
<h3>Classifier Training</h3>
|
616 |
<p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
|
617 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
@@ -624,7 +619,8 @@
|
|
624 |
</figure>
|
625 |
<div id="plot-edu-8k"></div>
|
626 |
</div>
|
627 |
-
<p>
|
|
|
628 |
<div class="main-plot-container">
|
629 |
<figure>
|
630 |
<img src="assets/images/edu-100k.png">
|
|
|
108 |
releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
|
109 |
every 1 or 2 months. </p>
|
110 |
<p>As an example, the latest CC crawl (April 2024) contains 2.7
|
111 |
+
billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl. Note also that we use "dump" or "crawl" interchangeability in this report.</d-footnote>.
|
112 |
Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
|
113 |
<d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
|
114 |
|
|
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
|
153 |
+
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
+
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
|
|
204 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
205 |
version of those websites.</p>
|
206 |
<p>A large number of datasets take the WET files as their
|
207 |
+
starting point. In our experience the default text extraction used by Common Crawl to create these WET files is suboptimal for the goals of LLM pretraining<d-footnote>In particular we suspect that it keeps too much boilerplate content and navigation menus.</d-footnote> and there are a variety of open-source libraries that
|
208 |
+
provide better text extraction. We extracted
|
209 |
+
the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p>
|
210 |
+
<aside>You can find a benchmark comparing several text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
|
211 |
<p>To validate this decision, we processed the 2019-18 dump
|
212 |
directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
|
213 |
processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
|
|
|
221 |
<figure><img src="assets/images/wet_comparison.png"/></figure>
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
+
|
225 |
<h3>Base filtering</h3>
|
226 |
+
<p>Filtering is an important part of the curation process. It consists in
|
227 |
+
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
+
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
231 |
<ul>
|
|
|
244 |
</li>
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
+
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
|
|
248 |
<h3>Deduplication</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
252 |
+
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
+
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
+
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
+
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
|
256 |
+
efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
|
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
+
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
|
262 |
+
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
+
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
+
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
|
|
278 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
279 |
trade off.</p>
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
+
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
+
<p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
+
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
+
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
+
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
|
|
|
|
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
+
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
+
tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
|
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
+
<p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
|
|
|
|
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
|
|
322 |
<figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
|
323 |
<div id="plot-removed_data_dedup"></div>
|
324 |
</div>
|
325 |
+
<p>These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually <em>worse</em> than the 90% of data we
|
326 |
+
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
|
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
+
<p>We decided to experimence with alternative approaches: we deduplicated
|
330 |
+
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
333 |
+
that it now matches RefinedWeb’s performance (see curves below):</p>
|
334 |
<div class="main-plot-container">
|
335 |
<figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
|
336 |
<div id="plot-ind_dedup_better"></div>
|
337 |
</div>
|
338 |
<p>We hypothesize that the main improvement gained from
|
339 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
340 |
+
some examples of these clusters in the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
|
341 |
documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
|
342 |
of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
|
343 |
actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
|
|
|
348 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
349 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
350 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
351 |
+
|
352 |
+
<h4>A note on measuring the effect of deduplication</h4>
|
353 |
<p>Given the nature of deduplication, its effect is not
|
354 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
355 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
+
document in a is unique in this dump)
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
|
|
397 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
398 |
measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
|
399 |
removed.</p>
|
400 |
+
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
+
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
|
403 |
+
independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
|
|
430 |
<figure><img src="assets/images/dedup_attempts.png"/></figure>
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
+
|
434 |
<h3>Additional filtering</h3>
|
435 |
+
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
+
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
|
|
437 |
<p>We therefore set out to find new filtering steps that
|
438 |
would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
|
439 |
was to look into the processing of C4 itself.</p>
|
|
|
454 |
<ul>
|
455 |
<li>applying “All filters” (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
457 |
+
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
|
|
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
|
|
482 |
the next section.</p>
|
483 |
<h4>A statistical approach to develop heuristic filters</h4>
|
484 |
<p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
|
485 |
+
<ol><li>we started by collecting a very large list of high level statistics of our datasets (over <strong>fifty</strong> different metrics) ranging from common document-level
|
486 |
+
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (inspired by MassiveText), on both a high quality and a lower quality web dataset;</li>
|
|
|
487 |
<li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
|
488 |
<li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
|
489 |
<li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
|
490 |
</ol>
|
491 |
+
<p>Due to our (new) assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
|
492 |
MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
|
493 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
494 |
<p>Perhaps not too surprisingly given our findings for deduplication, we found significant
|
|
|
606 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
607 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
608 |
</div>
|
609 |
+
<p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
|
610 |
<h3>Classifier Training</h3>
|
611 |
<p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
|
612 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
|
|
619 |
</figure>
|
620 |
<div id="plot-edu-8k"></div>
|
621 |
</div>
|
622 |
+
<p><strong>Note:</strong> this ablation was conducted on 8B tokens from the 2024-10 dump for both the FineWeb and FineWeb-Edu subsets, which might not be representative of the entire dataset. The next ablation shows that the findings for threshold 3 hold on a longer run of 350B tokens from all FineWeb dumps, except for HellaSwag, where we noticed a slight performance degradation.</p>
|
623 |
+
<p>We built 📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.3 trillion educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
|
624 |
<div class="main-plot-container">
|
625 |
<figure>
|
626 |
<img src="assets/images/edu-100k.png">
|
src/index.html
CHANGED
@@ -108,7 +108,7 @@
|
|
108 |
releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
|
109 |
every 1 or 2 months. </p>
|
110 |
<p>As an example, the latest CC crawl (April 2024) contains 2.7
|
111 |
-
billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl
|
112 |
Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
|
113 |
<d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
|
114 |
|
@@ -150,7 +150,7 @@
|
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
|
153 |
-
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our ablation models have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
@@ -161,7 +161,7 @@
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
-
resulting scores to
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
@@ -204,10 +204,10 @@
|
|
204 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
205 |
version of those websites.</p>
|
206 |
<p>A large number of datasets take the WET files as their
|
207 |
-
starting point. In our experience the default text extraction
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
<p>To validate this decision, we processed the 2019-18 dump
|
212 |
directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
|
213 |
processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
|
@@ -221,10 +221,11 @@
|
|
221 |
<figure><img src="assets/images/wet_comparison.png"/></figure>
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
|
|
224 |
<h3>Base filtering</h3>
|
225 |
-
<p>Filtering is an important part of the curation process. It
|
226 |
-
|
227 |
-
deemed to be “lower quality
|
228 |
<p>As a basis for our filtering we used part of the setup
|
229 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
230 |
<ul>
|
@@ -243,26 +244,25 @@
|
|
243 |
</li>
|
244 |
</ul>
|
245 |
<p>After applying this filtering to each of the text
|
246 |
-
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data
|
247 |
-
tokenized with the <code>gpt2</code> tokenizer).</p>
|
248 |
<h3>Deduplication</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
252 |
-
just otherwise repeated content spread over different domains and webpages.
|
253 |
-
can be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
-
<p>Removing these duplicates (deduplicating) has been
|
255 |
-
allow for better generalization. Additionally, the performance uplift obtained through deduplication can
|
256 |
-
efficiency: by removing duplicated content,
|
257 |
-
more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
258 |
<p>There are different ways to identify and even define
|
259 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
260 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
261 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
262 |
-
documents (or lines, paragraphs, or whatever other granularity level being used)
|
|
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
-
<p>
|
265 |
-
fuzzy hash based deduplication technique that scales
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
@@ -278,28 +278,24 @@
|
|
278 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
279 |
trade off.</p>
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
|
|
281 |
<h4>More deduplication is always better, right?</h4>
|
282 |
-
<p>
|
283 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
284 |
<p>We did this in an iterative manner: starting with the most
|
285 |
-
recent dump (which at the time was 2023-50) and proceeding chronologically until the oldest
|
286 |
-
not only within itself, but
|
287 |
dumps. </p>
|
288 |
<p>For instance, for the second most recent dump (2023-40 at
|
289 |
-
the time), we deduplicated it against the most recent one in addition to within itself.
|
290 |
-
dump was deduplicated against all other dumps. As a result, more data was removed from the oldest dumps (last
|
291 |
-
to be deduplicated) than from the most recent ones.</p>
|
292 |
<p>Deduplicating the dataset in this manner resulted in 4
|
293 |
-
trillion tokens of data, but, quite surprisingly
|
294 |
-
tokens subset,
|
295 |
-
green curves below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
|
296 |
<div class="main-plot-container">
|
297 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
298 |
<div id="plot-all_dumps_bad"></div>
|
299 |
</div>
|
300 |
-
<p>This was
|
301 |
-
data was that more deduplication would always result in improved performance. We decided to take a closer
|
302 |
-
look at one of the oldest dumps, dump 2013-48:</p>
|
303 |
<ul>
|
304 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
305 |
</ul>
|
@@ -326,23 +322,22 @@
|
|
326 |
<figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
|
327 |
<div id="plot-removed_data_dedup"></div>
|
328 |
</div>
|
329 |
-
<p>These results show that, for this older dump
|
330 |
-
removed
|
331 |
-
removed (considered independently of all the other dumps). This is also confirmed by visual inspection: <em>originally kept
|
332 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
333 |
<h4>Taking a step back: individual dump dedup</h4>
|
334 |
-
<p>We
|
335 |
-
each dump with MinHash individually (
|
336 |
tokens of data.</p>
|
337 |
<p>When training on a random sample from this dataset we see
|
338 |
-
that it now matches RefinedWeb’s performance (
|
339 |
<div class="main-plot-container">
|
340 |
<figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
|
341 |
<div id="plot-ind_dedup_better"></div>
|
342 |
</div>
|
343 |
<p>We hypothesize that the main improvement gained from
|
344 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
345 |
-
some examples of these clusters
|
346 |
documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
|
347 |
of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
|
348 |
actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
|
@@ -353,7 +348,8 @@
|
|
353 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
354 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
355 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
356 |
-
|
|
|
357 |
<p>Given the nature of deduplication, its effect is not
|
358 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
359 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
@@ -366,7 +362,7 @@
|
|
366 |
</ul>
|
367 |
<ul>
|
368 |
<li>each dump has been perfectly individually deduplicated (every single
|
369 |
-
document in
|
370 |
</li>
|
371 |
</ul>
|
372 |
<ul>
|
@@ -401,9 +397,10 @@
|
|
401 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
402 |
measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
|
403 |
removed.</p>
|
|
|
404 |
<h4>Other (failed) global approaches</h4>
|
405 |
-
<p>We attempted to improve the performance
|
406 |
-
independently minhash deduped 20 trillion tokens of data
|
407 |
<ul>
|
408 |
<li>URL deduplication, where we only kept one document per normalized
|
409 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
@@ -433,10 +430,10 @@
|
|
433 |
<figure><img src="assets/images/dedup_attempts.png"/></figure>
|
434 |
<div id="plot-dedup_attempts"></div>
|
435 |
</div>
|
|
|
436 |
<h3>Additional filtering</h3>
|
437 |
-
<p>By this point we had reached the same performance
|
438 |
-
RefinedWeb
|
439 |
-
the caveat that it is a relatively small dataset for current web-scale standards).</p>
|
440 |
<p>We therefore set out to find new filtering steps that
|
441 |
would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
|
442 |
was to look into the processing of C4 itself.</p>
|
@@ -457,8 +454,7 @@
|
|
457 |
<ul>
|
458 |
<li>applying “All filters” (drop lines not ending on punctuation marks,
|
459 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
460 |
-
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (
|
461 |
-
pink curves).
|
462 |
</li>
|
463 |
</ul>
|
464 |
<ul>
|
@@ -486,14 +482,13 @@
|
|
486 |
the next section.</p>
|
487 |
<h4>A statistical approach to develop heuristic filters</h4>
|
488 |
<p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
|
489 |
-
<ol><li>we started by collecting a very large list of high level statistics (over <strong>
|
490 |
-
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (MassiveText
|
491 |
-
inspired), on both a high quality and a lower quality web dataset;</li>
|
492 |
<li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
|
493 |
<li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
|
494 |
<li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
|
495 |
</ol>
|
496 |
-
<p>Due to our assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
|
497 |
MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
|
498 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
499 |
<p>Perhaps not too surprisingly given our findings for deduplication, we found significant
|
|
|
108 |
releases a new crawl containing 200 to 400 TiB of textual content obtained via automatic web crawling usually
|
109 |
every 1 or 2 months. </p>
|
110 |
<p>As an example, the latest CC crawl (April 2024) contains 2.7
|
111 |
+
billion web pages, totaling 386 TiB of uncompressed HTML text content<d-footnote>Note that the size changes from crawl to crawl. Note also that we use "dump" or "crawl" interchangeability in this report.</d-footnote>.
|
112 |
Ninety-six crawls have been released since 2013 and 3 crawls from 2008 to 2012, which are in a different (older) format.
|
113 |
<d-footnote>We have not processed these 3 older crawls.</d-footnote> </p>
|
114 |
|
|
|
150 |
scores.</p>
|
151 |
<p>Our ablation models were trained using <a
|
152 |
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
|
153 |
+
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. Our "ablation models" have 1.82B parameters (including embeddings), used the Llama
|
154 |
architecture with a 2048 sequence length, a global batch size of ~2 million tokens, and the GPT2 tokenizer. For most
|
155 |
ablations we trained on ~28B tokens (roughly the Chinchilla<d-cite bibtex-key="hoffmann2022training"></d-cite> optimal training size for this
|
156 |
model size). To confirm relative performance improvements after each step of filtering we conducted longer training runs on 350 billion tokens as mentioned further below.</p>
|
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
+
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
|
|
204 |
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
|
205 |
version of those websites.</p>
|
206 |
<p>A large number of datasets take the WET files as their
|
207 |
+
starting point. In our experience the default text extraction used by Common Crawl to create these WET files is suboptimal for the goals of LLM pretraining<d-footnote>In particular we suspect that it keeps too much boilerplate content and navigation menus.</d-footnote> and there are a variety of open-source libraries that
|
208 |
+
provide better text extraction. We extracted
|
209 |
+
the text content from the WARC files using the trafilatura library<d-cite bibtex-key="barbaresi-2021-trafilatura"></d-cite>, which from visual inspection of the results provided good quality extraction when compared to other libraries.</p>
|
210 |
+
<aside>You can find a benchmark comparing several text extraction libraries <a href="https://github.com/scrapinghub/article-extraction-benchmark/blob/master/README.rst">here</a>.</aside>
|
211 |
<p>To validate this decision, we processed the 2019-18 dump
|
212 |
directly using the WET files and with text extracted from WARC files using trafilatura<d-footnote>We used trafilatura default options with <code>favour_precision=True</code>.</d-footnote>. We applied the same
|
213 |
processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
|
|
|
221 |
<figure><img src="assets/images/wet_comparison.png"/></figure>
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
+
|
225 |
<h3>Base filtering</h3>
|
226 |
+
<p>Filtering is an important part of the curation process. It consists in
|
227 |
+
removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
|
228 |
+
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
231 |
<ul>
|
|
|
244 |
</li>
|
245 |
</ul>
|
246 |
<p>After applying this filtering to each of the text
|
247 |
+
extracted dumps (there are currently 96 dumps) we obtained roughly 36 trillion tokens of data<d-footnote>As everywhere in this report: this is the number of tokens when tokenized with the <code>gpt2</code> tokenizer</d-footnote>.</p>
|
|
|
248 |
<h3>Deduplication</h3>
|
249 |
<p>Deduplication is one of the most important steps when creating large web datasets for LLM pretraining. Methods to deduplicate datasets attempt to identify and remove redundant/repeated data from the dataset. </p>
|
250 |
<h4>Why deduplicate?</h4>
|
251 |
<p>The web has many aggregators, mirrors, templated pages or
|
252 |
+
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
+
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
+
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
+
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
|
256 |
+
efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
|
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
+
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
|
262 |
+
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
+
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
+
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
|
|
278 |
allows for a steeper, more well defined cut off (documents with real similarity near the threshold are more likely to be correctly identified), we believe the compute and storage savings are a reasonable
|
279 |
trade off.</p>
|
280 |
<p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
|
281 |
+
|
282 |
<h4>More deduplication is always better, right?</h4>
|
283 |
+
<p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
|
284 |
90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
|
285 |
<p>We did this in an iterative manner: starting with the most
|
286 |
+
recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
|
287 |
+
not only within itself, but removing any document matching any other documents in the previously processed
|
288 |
dumps. </p>
|
289 |
<p>For instance, for the second most recent dump (2023-40 at
|
290 |
+
the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
|
|
|
|
|
291 |
<p>Deduplicating the dataset in this manner resulted in 4
|
292 |
+
trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
|
293 |
+
tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
|
|
|
294 |
<div class="main-plot-container">
|
295 |
<figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
|
296 |
<div id="plot-all_dumps_bad"></div>
|
297 |
</div>
|
298 |
+
<p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
|
|
|
|
|
299 |
<ul>
|
300 |
<li>pre deduplication, this dump had ~490 billion tokens</li>
|
301 |
</ul>
|
|
|
322 |
<figure><img src="assets/images/removed_data_cross_dedup.png"/></figure>
|
323 |
<div id="plot-removed_data_dedup"></div>
|
324 |
</div>
|
325 |
+
<p>These results show that, for this older dump taken in isolation, the data that was kept (10% of the original data) was actually <em>worse</em> than the 90% of data we
|
326 |
+
removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
|
|
|
327 |
data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
|
328 |
<h4>Taking a step back: individual dump dedup</h4>
|
329 |
+
<p>We decided to experimence with alternative approaches: we deduplicated
|
330 |
+
each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
|
331 |
tokens of data.</p>
|
332 |
<p>When training on a random sample from this dataset we see
|
333 |
+
that it now matches RefinedWeb’s performance (see curves below):</p>
|
334 |
<div class="main-plot-container">
|
335 |
<figure><img src="assets/images/cross_ind_unfiltered_comparison.png"/></figure>
|
336 |
<div id="plot-ind_dedup_better"></div>
|
337 |
</div>
|
338 |
<p>We hypothesize that the main improvement gained from
|
339 |
deduplication is the removal of very large clusters that are present in every single dump (you will find
|
340 |
+
some examples of these clusters in the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
|
341 |
documents) and that further deduplication for clusters with a low number of duplicates (less than ~100 i.e. the number
|
342 |
of dumps) actually harms performance: data that does not find a duplicate match in any other dump might
|
343 |
actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
|
|
|
348 |
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
|
349 |
lower quality data. We also experimented with applying different, and often “lighter”, deduplication
|
350 |
approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
|
351 |
+
|
352 |
+
<h4>A note on measuring the effect of deduplication</h4>
|
353 |
<p>Given the nature of deduplication, its effect is not
|
354 |
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
|
355 |
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
|
|
|
362 |
</ul>
|
363 |
<ul>
|
364 |
<li>each dump has been perfectly individually deduplicated (every single
|
365 |
+
document in a is unique in this dump)
|
366 |
</li>
|
367 |
</ul>
|
368 |
<ul>
|
|
|
397 |
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
|
398 |
measuring deduplication impact on the training of LLMs, once the biggest duplicate clusters have been
|
399 |
removed.</p>
|
400 |
+
|
401 |
<h4>Other (failed) global approaches</h4>
|
402 |
+
<p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
|
403 |
+
independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
|
404 |
<ul>
|
405 |
<li>URL deduplication, where we only kept one document per normalized
|
406 |
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
|
|
|
430 |
<figure><img src="assets/images/dedup_attempts.png"/></figure>
|
431 |
<div id="plot-dedup_attempts"></div>
|
432 |
</div>
|
433 |
+
|
434 |
<h3>Additional filtering</h3>
|
435 |
+
<p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
|
436 |
+
RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
|
|
|
437 |
<p>We therefore set out to find new filtering steps that
|
438 |
would, at first, allow us to match the performance of C4 and, at a second stage, surpass it. A natural starting point
|
439 |
was to look into the processing of C4 itself.</p>
|
|
|
454 |
<ul>
|
455 |
<li>applying “All filters” (drop lines not ending on punctuation marks,
|
456 |
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
|
457 |
+
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
|
|
|
458 |
</li>
|
459 |
</ul>
|
460 |
<ul>
|
|
|
482 |
the next section.</p>
|
483 |
<h4>A statistical approach to develop heuristic filters</h4>
|
484 |
<p>To develop new heuristic filters and select their thresholds we devised a systematic process:</p>
|
485 |
+
<ol><li>we started by collecting a very large list of high level statistics of our datasets (over <strong>fifty</strong> different metrics) ranging from common document-level
|
486 |
+
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (inspired by MassiveText), on both a high quality and a lower quality web dataset;</li>
|
|
|
487 |
<li>we selected the metrics for which the Wasserstein distance between the two distributions (of the metric computed on each dataset) was larger;</li>
|
488 |
<li>we inspected the histograms of the two distributions and empirically chose a threshold that would make the lower quality dataset more closely resemble the higher quality one on this metric;</li>
|
489 |
<li>we validated the resulting filter (metric-threshold pair) by using it on a reference dataset and running small ablations.</li>
|
490 |
</ol>
|
491 |
+
<p>Due to our (new) assumption that global MinHash greatly upsamples lower quality data in the oldest dumps, we computed metrics on both the independently
|
492 |
MinHashed and the (worse quality) global MinHashed versions of the 2013-48 and 2015-22 crawls (two older crawls). We then compared the
|
493 |
statistics at a macro level, by looking at the distribution of these metrics for each one.</p>
|
494 |
<p>Perhaps not too surprisingly given our findings for deduplication, we found significant
|