|
<!doctype html> |
|
|
|
<head> |
|
<script src="https://distill.pub/template.v2.js"></script> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
<meta charset="utf8"> |
|
<title>FineWeb: 15T tokens of high quality web data</title> |
|
</head> |
|
|
|
<body> |
|
<d-front-matter> |
|
<script id='distill-front-matter' type="text/json">{ |
|
"title": "FineWeb: 15T tokens of high quality web data", |
|
"description": "This blog covers the FineWeb recipe, why more deduplication is not always better and some interesting findings on the difference in quality of CommonCrawl dumps.", |
|
"published": "May 28, 2024", |
|
"authors": [ |
|
{ |
|
"author":"Guilherme Penedo", |
|
"authorURL":"https://huggingface.co/guipenedo", |
|
"affiliations": [{"name": "HuggingFace"}] |
|
}, |
|
{ |
|
"author":"Hynek Kydlíček", |
|
"authorURL":"https://huggingface.co/hynky" |
|
}, |
|
{ |
|
"author":"Leandro Werra", |
|
"authorURL":"https://huggingface.co/lvwerra" |
|
}, |
|
{ |
|
"author":"Thomas Wolf", |
|
"authorURL":"https://huggingface.co/thomwolf" |
|
} |
|
], |
|
"katex": { |
|
"delimiters": [ |
|
{"left": "$$", "right": "$$", "display": false} |
|
] |
|
} |
|
} |
|
</script> |
|
</d-front-matter> |
|
<d-title> |
|
<figure style="grid-column: page; mix-blend-mode: multiply;"> |
|
<img src="banner.png" alt="FineWeb"> |
|
</figure> |
|
|
|
|
|
|
|
</d-title> |
|
<d-byline></d-byline> |
|
<d-article> |
|
<p>We have recently released 🍷FineWeb, our new large scale |
|
(15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can |
|
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p> |
|
<p>As 🍷FineWeb has gathered a lot of interest from the |
|
community, we decided to further explain the steps involved in creating it, our processing decisions and |
|
some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p> |
|
<p><strong>TLDR:</strong> This blog covers the FineWeb |
|
recipe, why more deduplication is not always better and some interesting findings on the difference in |
|
quality of CommonCrawl dumps.</p> |
|
<hr/> |
|
<h1>Preamble</h1> |
|
<h2>Sourcing the data</h2> |
|
<p>A common question we see asked regarding web datasets used |
|
to train LLMs is “where do they even get all that data?” There are generally two options:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">you either crawl it yourself, like <a |
|
href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a |
|
href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">you use a public repository of crawled webpages, like the one maintained by |
|
the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li> |
|
</ul> |
|
<p>For FineWeb, similarly to what was done for a large number |
|
of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point. |
|
They have been crawling the web since 2007 (long before LLMs were a thing) and release a new dump usually |
|
every 1 or 2 months, which can be freely downloaded. </p> |
|
<p>As an example, their latest crawl (2024-10) contains 3.16 |
|
billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There |
|
are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p> |
|
<h2>Processing at scale</h2> |
|
<p>Given the sheer size of the data involved, one of the main |
|
challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate |
|
on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads |
|
and providing clear insights into the data. </p> |
|
<p>For this purpose, we developed <a |
|
href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data |
|
processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of |
|
CPU cores. All of the data processing steps involved in the creation of FineWeb used this <a |
|
href="https://github.com/huggingface/datatrove">library</a>.</p> |
|
<h2>What is clean, good data?</h2> |
|
<p>This is probably the main question to keep in mind when |
|
creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a |
|
human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p> |
|
<p>It is still common to train a model on a given corpus |
|
(wikipedia, or some other web dataset considered clean) and use it to check the perplexity on the dataset |
|
that we were trying to curate. Unfortunately this does not always correlate with performance on downstream |
|
tasks, and so another often used approach is to train small models (small because training models is |
|
expensive and time consuming, and we want to be able to quickly iterate) on our dataset and evaluate them on |
|
a set of evaluation tasks. As we are curating a dataset for pretraining a generalist LLM, it is important to |
|
choose a diverse set of tasks and try not to overfit to any one individual benchmark.</p> |
|
<p>Another way to evaluate different datasets would be to |
|
train a model on each one and have humans rate and compare the outputs of each one (like on the <a |
|
href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>). This would arguably provide the most |
|
reliable results in terms of representing real model usage, but getting ablation results this way is too |
|
expensive and slow.</p> |
|
<p>The approach we ultimately went with was to train small |
|
models and evaluate them on a set of benchmark tasks. We believe this is a reasonable proxy for the quality |
|
of the data used to train these models.</p> |
|
<h3>Ablations and evaluation setup</h3> |
|
<p>To be able to compare the impact of a given processing |
|
step, we would train 2 models, one where the data included the extra step and another where this step was |
|
ablated (cut/removed). These 2 models would have the same number of parameters, architecture, and be trained |
|
on an equal number of tokens and with the same hyperparameters — the only difference would be in the |
|
training data. We would then evaluate each model on the same set of tasks and compare the average |
|
scores.</p> |
|
<p>Our ablation models were trained using <a |
|
href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO: |
|
INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. The models had 1.82B parameters, used the Llama |
|
architecture with a 2048 sequence length, and a global batch size of ~2 million tokens. For filtering |
|
ablations we mostly trained on ~28B tokens (which is roughly the Chinchilla optimal training size for this |
|
model size).</p> |
|
<p>We evaluated the models using <a |
|
href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting |
|
benchmarks that would provide good signal at a relatively small scale (small models trained on only a few |
|
billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">small variance between runs trained on different samplings of the same |
|
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the |
|
resulting scores to have as little noise as possible |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">performance increasing monotonically (or close) over a training run: |
|
ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease |
|
(should not be too noisy) |
|
</li> |
|
</ul> |
|
<p>You can find the full list of tasks and prompts we used <a |
|
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To |
|
have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5 |
|
min on a single node of 8 GPUs - done in parallel to the training).</p> |
|
<hr /> |
|
<h1>The FineWeb recipe</h1> |
|
<p>In the next subsections we will explain each of the steps |
|
taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a |
|
href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p> |
|
<style> |
|
.neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;} |
|
</style> |
|
<div class="neighborhood-figure-container"> |
|
<figure class="image"> |
|
<img style="width:708px" src="plots/fineweb-recipe.png"/> |
|
</figure> |
|
</div> |
|
<h2>Starting point: text extraction</h2> |
|
<p>CommonCrawl data is available in two main formats: WARC |
|
and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the |
|
full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only |
|
version of those websites.</p> |
|
<p>A large number of datasets take the WET files as their |
|
starting point. In our experience the default text extraction (extracting the main text of a webpage from |
|
its HTML) used to create these WET files is suboptimal and there are a variety of open-source libraries that |
|
provide better text extraction (by, namely, keeping less boilerplate content/navigation menus). We extracted |
|
the text content from the WARC files using the <a href="https://trafilatura.readthedocs.io/en/latest/">trafilatura</a> |
|
library. It is important to note, however, that text extraction is one of the most costly steps of our |
|
processing, so we believe that using the readily available WET data could be a reasonable trade-off for |
|
lower budget teams.</p> |
|
<p>To validate this decision, we processed the 2019-18 dump |
|
directly using the WET files and with text extracted from WARC files using trafilatura. We applied the same |
|
processing to each one (our base filtering+minhash, detailed below) and trained two models. While the |
|
resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse |
|
quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of |
|
these additional tokens on the WET files are unnecessary page boilerplate.</p> |
|
<figure class="image"><a href="plots/wet_comparison.png"><img |
|
style="width:640px" src="plots/wet_comparison.png"/></a></figure> |
|
|
|
<h2>Base filtering</h2> |
|
<p>Filtering is an important part of the curation process. It |
|
removes part of the data (be it words, lines, or full documents) that would harm performance and is thus |
|
deemed to be “lower quality”.</p> |
|
<p>As a basis for our filtering we used part of the setup |
|
from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Applied URL filtering using a <a |
|
href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Applied a <a |
|
href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to |
|
keep only English text with a score ≥ 0.65 |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Applied quality and repetition filters from the <a |
|
href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds) |
|
</li> |
|
</ul> |
|
<p>After applying this filtering to each of the text |
|
extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when |
|
tokenized with the <code>gpt2</code> tokenizer).</p> |
|
<h2>Deduplication</h2> |
|
<p>Deduplication is another important step, specially for web |
|
datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of |
|
the most important steps when creating large web datasets for LLMs.</p> |
|
<h3>Why deduplicate?</h3> |
|
<p>The web has many aggregators, mirrors, templated pages or |
|
just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages |
|
can be introduced by the crawler itself, when different links point to the same page. </p> |
|
<p>Removing these duplicates (deduplicating) has been <a |
|
href="https://arxiv.org/abs/2107.06499">linked to an improvement in model performance</a> and a <a |
|
href="https://arxiv.org/abs/2202.07646">reduction in memorization of pretraining data</a>, which might |
|
allow for better generalization. Additionally, the performance uplift can also be tied to increased training |
|
efficiency: by removing duplicated content, for the same number of training tokens, a model will have seen |
|
more diverse data.</p> |
|
<p>There are different ways to identify and even define |
|
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building |
|
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some |
|
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two |
|
documents (or lines, paragraphs, or whatever other granularity level being used).</p> |
|
<h3>Our deduplication parameters</h3> |
|
<p>Similarly to RefinedWeb, we decided to apply MinHash, a |
|
fuzzy hash based deduplication technique. We chose to compute minhashes on each document’s 5-grams, using |
|
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least |
|
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p> |
|
<p>This would mean that for two documents with a similarity (<code>s</code>) |
|
of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%, |
|
92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability |
|
comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450 |
|
buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p> |
|
<figure class="image"><a |
|
href="plots/minhash_parameters_comparison.png"><img style="width:567px" |
|
src="plots/minhash_parameters_comparison.png"/></a> |
|
</figure> |
|
<p>While the high number of hash functions in RefinedWeb |
|
allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable |
|
trade off.</p> |
|
<h3>More deduplication is always better, right?</h3> |
|
<p>Our initial approach was to take the entire dataset (all |
|
95 dumps) and deduplicate them as one big dataset using MinHash.</p> |
|
<p>We did this in an iterative manner: starting with the most |
|
recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump |
|
not only against itself but also by removing any matches with duplicates from the previously processed |
|
dumps. </p> |
|
<p>For instance, for the second most recent dump (2023-40 at |
|
the time), we deduplicated it against the most recent one in addition to itself. In particular, the oldest |
|
dump was deduplicated against all other dumps. As a result, more data was removed in the oldest dumps (last |
|
to be deduplicated) than in the most recent ones.</p> |
|
<p>Deduplicating the dataset in this manner resulted in 4 |
|
trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion |
|
tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and |
|
green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p> |
|
<figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img |
|
style="width:576px" src="plots/dedup_all_dumps_bad.png"/></a></figure> |
|
<p>This was quite puzzling as our intuition regarding web |
|
data was that more deduplication would always result in improved performance. We decided to take a closer |
|
look at one of the oldest dumps, dump 2013-48:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">pre deduplication, this dump had ~490 billion tokens</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">after our iterative MinHash, ~31 billion tokens remained (94% of data |
|
removed) |
|
</li> |
|
</ul> |
|
<p>As an experiment, we tried training two models on 28BT |
|
sampled from the following data from 2013-48:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">the fully deduplicated remaining ~31 billion tokens (<em>originally kept |
|
data</em>) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">171 billion tokens obtained by individually deduplicating (without |
|
considering the other dumps) the ~460 billion tokens that had been removed from this dump in the |
|
iterative dedup process (<em>originally removed data</em>) |
|
</li> |
|
</ul> |
|
<figure class="image"><a |
|
href="plots/removed_data_cross_dedup.png"><img style="width:576px" |
|
src="plots/removed_data_cross_dedup.png"/></a></figure> |
|
<p>These results show that, for this older dump where we were |
|
removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data |
|
removed (considered independently from all the other dumps).</p> |
|
<h3>Taking a step back: individual dump dedup</h3> |
|
<p>We then tried an alternative approach: we deduplicated |
|
each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion |
|
tokens of data.</p> |
|
<p>When training on a random sample from this dataset we see |
|
that it now matches RefinedWeb’s performance (blue and red curves below):</p> |
|
<figure class="image"><a |
|
href="plots/cross_ind_unfiltered_comparison.png"><img style="width:576px" |
|
src="plots/cross_ind_unfiltered_comparison.png"/></a> |
|
</figure> |
|
<p>We hypothesis that the main improvement gained from |
|
deduplication is the removal of very large clusters that are present in every single dump (you will find |
|
some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of |
|
documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number |
|
of dumps) actually harm performance: data that does not find a duplicate match in any other dump might |
|
actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p> |
|
<p>While you might see some performance improvement when |
|
deduplicating a few dumps together, at the scale of all the dumps this upsampling of lower quality data side |
|
effect seems to have a great impact.</p> |
|
<p>One possibility to consider is that as filtering quality |
|
improves, this effect may not be as prevalent, since the filtering might be able to remove some of this |
|
lower quality data. We also experimented with applying different, and often “lighter”, deduplication |
|
approaches on top of the individually deduplicated dumps. You can read about them further below.</p> |
|
<h3>A note on measuring the effect of deduplication</h3> |
|
<p>Given the nature of deduplication, its effect is not |
|
always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our |
|
filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when |
|
deduplicating across all CommonCrawl dumps, as some URLs/pages are recrawled from one dump to the next.</p> |
|
<p>To visualize the effect of scaling the number of training |
|
tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic |
|
regarding the degree of duplication observed) theoretical scenario:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">there are 100 CommonCrawl dumps (actually roughly true)</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">each dump has been perfectly individually deduplicated (every single |
|
document in it is unique) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">each dump is a perfect copy of each other (maximum possible duplication |
|
across dumps, effectively the worst case scenario) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">each dump has 200 billion tokens (for a total of 20 trillion, the resulting |
|
size of our individual dedup above) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">each dump is made up of documents of 1k tokens (200M documents per dump) |
|
</li> |
|
</ul> |
|
<p>We then simulated uniformly sampling documents from this |
|
entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image |
|
below you can see how often each document would be repeated.</p> |
|
<figure class="image"><a href="plots/dedup_impact_simulation.png"><img |
|
style="width:708px" src="plots/dedup_impact_simulation.png"/></a></figure> |
|
<p>For 1B almost all documents would be unique |
|
(#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per |
|
dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of |
|
documents being repeated twice, and a few even 4-8 times. At the larger scale of 1T (5% of the total |
|
dataset), the majority of the documents are repeated up to 8 times, with a some being repeated up to 16 |
|
times. </p> |
|
<p>We ran our performance evaluations for the deduplicated |
|
data at the 350B scale, which would, under this theoretical scenario, be made up of a significant portion of |
|
documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with |
|
measuring deduplication impact on the training of LLMs, once the biggest document clusters have been |
|
removed.</p> |
|
<h3>Other (failed) approaches</h3> |
|
<p>We attempted to improve the performance of the |
|
independently minhash deduped 20T of data by further deduplicating it with the following methods</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">URL deduplication, where we only kept one document per normalized |
|
(lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Line deduplication: |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:circle">remove all but 1 occurrence of each duplicated line (77.8% of |
|
tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:circle">same as above, but only removing duplicate lines with at least 10 |
|
words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens |
|
dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:circle">remove all but 1 occurrence of each span of 3 duplicated lines |
|
with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line |
|
dedup</em></li> |
|
</ul> |
|
</li> |
|
</ul> |
|
<p>The performance of the models trained on each of these was |
|
consistently worse (even if to different degrees) than that of the original independently deduplicated |
|
data:</p> |
|
<figure class="image"><a href="plots/Untitled.png"><img |
|
style="width:708px" src="plots/Untitled.png"/></a></figure> |
|
<h2>Additional filtering</h2> |
|
<p>By this point we had reached the same performance as |
|
RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a |
|
href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with |
|
the caveat that it is a relatively small dataset for current web-scale standards).</p> |
|
<p>We therefore set out to find new filtering steps that |
|
would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point |
|
was to look into the processing of C4 itself.</p> |
|
<h3>C4: A dataset that has stood the test of time</h3> |
|
<p>The <a href="https://huggingface.co/datasets/c4">C4 |
|
dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by |
|
removing non english data, applying some heuristic filters on both the line and document level, |
|
deduplicating on the line level and removing documents containing words from a word blocklist.</p> |
|
<p>Despite its age and limited size (around 175B gpt2 |
|
tokens), models trained on this dataset have strong performance, excelling in particular on the Hellaswag |
|
benchmark, one of the benchmarks in our “early signal” group with the stronger signal and highest |
|
signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in in |
|
<a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying |
|
each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump |
|
(plot smoothed with a 3 checkpoints sliding window):</p> |
|
<figure class="image"><a href="plots/c4_filters.png"><img |
|
style="width:708px" src="plots/c4_filters.png"/></a></figure> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks, |
|
mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem |
|
ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus |
|
pink curves). |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">The curly bracket filter, and the word lengths filter only give a small |
|
boost, removing 2.8% and 4.3% of tokens, respectively |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">The terminal punctuation filter, by itself, gives the biggest individual |
|
boost, but removes <em>around 30%</em> of all tokens (!) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">The lorem_ipsum, javascript and policy rules each remove <0.5% of |
|
training tokens, so we did not train on them individually |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">All filters except the very destructive terminal_punct perform better than |
|
terminal_punct by itself, while removing less in total (~7%) |
|
</li> |
|
</ul> |
|
<p>We decided to apply all C4 filters mentioned above except |
|
the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in |
|
the next section.</p> |
|
<h3>A statistical approach to develop heuristic filters</h3> |
|
<p>To come up with new possible filtering rules, we collected |
|
a very large list of statistics (statistical metrics) — over <strong>50</strong> — from different reference |
|
datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently |
|
minhashed version and the result from the (worse quality) full dedup. This allowed us to compare the |
|
different datasets at a macro level, by looking at the distribution of these metrics for each one.</p> |
|
<p>The collected statistics ranged from common document-level |
|
metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher |
|
inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant |
|
disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code> |
|
metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup |
|
(0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48), |
|
indicating that the latter had higher inter-document repetition.</p> |
|
<p>Working under the assumption that these differences were |
|
caused by lower quality data on the full dedup version, we inspected histograms and manually defined |
|
thresholds for the metrics where these differences were starker. This process yielded 17 candidate |
|
threshold-filter pairs. In the image below, you can see 3 of these histograms.</p> |
|
<figure class="image"><a href="plots/Untitled%201.png"><img |
|
style="width:790px" src="plots/Untitled%201.png"/></a></figure> |
|
|
|
<p>To assess the effectiveness of these newly created |
|
filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out |
|
of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated |
|
the most significant improvements on the aggregate score:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Remove documents where the fraction of lines ending with punctuation ≤ 0.12 |
|
(10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Remove documents where the fraction of characters in duplicated lines ≥ 0.1 |
|
(12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2 |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">Remove documents where the fraction of lines shorter than 30 characters ≥ |
|
0.67 (3.73% of tokens removed) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li> |
|
</ul> |
|
<figure class="image"><a href="plots/Untitled%202.png"><img |
|
style="width:708px" src="plots/Untitled%202.png"/></a></figure> |
|
<hr /> |
|
<h1>The final dataset</h1> |
|
<p>The final FineWeb dataset comprises 15T tokens and |
|
includes the following previously mentioned steps, in order, each providing a performance boost on our group |
|
of benchmark tasks:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">base filtering</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">independent MinHash deduplication per dump</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">a selection of C4 filters</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li> |
|
</ul> |
|
<figure class="image"><a href="plots/fineweb_all_filters.png"><img |
|
style="width:708px" src="plots/fineweb_all_filters.png"/></a></figure> |
|
<p>We compared 🍷 FineWeb with the following datasets:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a |
|
href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the |
|
CommonCrawl part) |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a |
|
href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a> |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc"><a |
|
href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a> |
|
(deduplicated) |
|
</li> |
|
</ul> |
|
<p>You will find these models on <a |
|
href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this |
|
collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a |
|
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation |
|
results here</a>.</p> |
|
<figure class="image"><a href="plots/fineweb_ablations.png"><img |
|
style="width:708px" src="plots/fineweb_ablations.png"/></a></figure> |
|
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and |
|
FineWeb:</p> |
|
<figure class="image"><a href="plots/Untitled%203.png"><img |
|
style="width:4587px" src="plots/Untitled%203.png"/></a></figure> |
|
<hr /> |
|
<h1>Just like fine wine, not all crawls are created |
|
equal</h1> |
|
<p>During our ablation runs, we observed that certain crawls |
|
outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for |
|
each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used |
|
a different data subset. We trained 190 such models, totaling over 60k H100 GPU-hours. We subsequently took |
|
the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p> |
|
<p>The plot below clearly shows that some dumps perform far |
|
worse than others. Each year has a different color, and the number of crawls per year also changes.</p> |
|
<figure class="image"><a href="plots/score_by_dump.png"><img |
|
style="width:708px" src="plots/score_by_dump.png"/></a></figure> |
|
<p>We identified 5 main relevant time intervals:</p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">2017 to 2018: high quality, with a drop by the end of 2018</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">2019 to 2021: high quality, steadily increase</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">2021-49 and 2022: very large drop in performance, followed by worse quality |
|
dumps |
|
</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">2023 and 2024-10: almost exponential improvement. In particular, 2023-50 |
|
and 2024-10 are by far the best dumps |
|
</li> |
|
</ul> |
|
<p>One possibility to improve performance when training |
|
models on < 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p> |
|
<p>We conducted further analysis to investigate the factors |
|
causing these differences from dump to dump. In particular, we considered 3 potential causes: </p> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">large sudden changes in the list of crawled URLs;</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">synthetic (LLM generated) data;</li> |
|
</ul> |
|
<ul class="bulleted-list"> |
|
<li style="list-style-type:disc">benchmark contamination;</li> |
|
</ul> |
|
<p>We go over each one in the following sections.</p> |
|
<h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3> |
|
<p>For each crawl from 2021-10 onwards, we gathered a list of |
|
the 60k most frequent <strong>FQDNs</strong> (fully qualified domain name). We then calculated the <a |
|
href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> between consecutive |
|
crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding |
|
it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or |
|
that alternatively new FQDNs were added to the top 60k.</p> |
|
<figure class="image"><a href="plots/Untitled%204.png"><img |
|
style="width:5026px" src="plots/Untitled%204.png"/></a></figure> |
|
<p>The data indicates three significant changes: |
|
2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p> |
|
<p>The explanation for the changes between 2022-33/2022-40 |
|
and 2023-40/2023-50 is straightforward: CommonCrawl accidentally did not index several popular suffixes, |
|
such as .co.uk, as documented on <a href="https://commoncrawl.org/errata/co-uk-cctld-not-included">this |
|
erratum</a>. This particular change does not seem particularly correlated on the overall dump quality. |
|
</p> |
|
<p>As to the shift from 2021-43 to 2021-49, which coincides |
|
with a sharp performance drop, roughly half (~30k) of the former’s top 60k FQDNs are not present in the |
|
latter’s list of top 60k FQDNs, and the dump size itself also decreased (19% reduction in WARC size, and a |
|
28% token reduction after deduplication). </p> |
|
<p>We were unable to find a clear reason for this drastic |
|
change, but upon reaching out to CommonCrawl, we were informed that these differences likely stem from a |
|
major update in adult content and malicious site blocking. It is therefore possible that the new updated |
|
adult site filter could have also removed a high number of high quality domains resulting in poor |
|
performance of the crawl. <strong>[TODO: change this framing a bit, it seems to suggest adult content is |
|
high quality for LLMs]</strong></p> |
|
<h3>Synthetic data contamination [HAVE TO RECHECK]</h3> |
|
<p>Secondly, we wondered if part of the changes in |
|
performance on recent dumps could be attributed to the presence of a larger quantity of synthetic data (data |
|
generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs, |
|
notably of ChatGPT.</p> |
|
<p>Since, to the best of our knowledge, there is no fool |
|
proof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the |
|
following words: <code>delve, as a large language model, it's important to note, rich tapestry, |
|
intertwined, certainly!, dive into</code>, which are words commonly used by ChatGPT.</p> |
|
<p>It is important to note that not all samples containing |
|
one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do |
|
not contain any of these phrases), but assuming that the amount of synthetic data were to not change across |
|
dumps, one would expect these frequencies to remain approximately constant over time.</p> |
|
<p>The results are shown in the following graph:</p> |
|
<figure class="image"><a href="plots/Untitled%205.png"><img |
|
style="width:4156px" src="plots/Untitled%205.png"/></a></figure> |
|
<p>While the frequency remained approximately constant until |
|
2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric |
|
in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of |
|
<strong>0.590</strong>. It is therefore possible that synthetic data has positively impacted performance in |
|
our selected tasks for these most recent dumps (with all limitations in interpretation from a single |
|
correlation measurement without intervention of randomization or any causality tools being used here). In |
|
particular, it could explain why the 2023-50 and 2024-10 dumps have such a strong performance. </p> |
|
<h3>Benchmarks contamination [HAVE TO RECHECK]</h3> |
|
<p>Also, most of our used benchmarks were introduced around |
|
<strong>2019</strong>. It’s thus possible that the 2019-XX 2021-43 performance increase might be caused by |
|
higher benchmark contamination in those crawls. Similarly, the recent increase in LLM popularity and |
|
evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements |
|
of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p> |
|
|
|
<figure class="image"><a href="plots/Untitled%206.png"><img |
|
style="width:708px" src="plots/Untitled%206.png"/></a></figure> |
|
<hr /> |
|
<h1>Next steps</h1> |
|
<p>We want to continue improving FineWeb and will also |
|
release a technical report with more details soon.</p> |
|
<p>Adapting the FineWeb recipe [wip]</p> |
|
</d-article> |
|
|
|
<d-appendix> |
|
|
|
<h3>Contributions</h3> |
|
<p>Some text describing who did what.</p> |
|
<h3>Reviewers</h3> |
|
<p>Some text with links describing who reviewed the article.</p> |
|
|
|
<d-bibliography src="bibliography.bib"></d-bibliography> |
|
</d-appendix> |
|
</body> |
|
|