Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on May 23, 2024

Commit

d2c70ff

1 Parent(s): 54613ab

initial commit

Browse files

Files changed (23) hide show

README.md +3 -3
banner.png +0 -0
bibliography.bib +108 -0
index.html +661 -18
plots/FineWeb.png +0 -0
plots/Untitled 1.png +0 -0
plots/Untitled 2.png +0 -0
plots/Untitled 3.png +0 -0
plots/Untitled 4.png +0 -0
plots/Untitled 5.png +0 -0
plots/Untitled 6.png +0 -0
plots/Untitled.png +0 -0
plots/c4_filters.png +0 -0
plots/cross_ind_unfiltered_comparison.png +0 -0
plots/dedup_all_dumps_bad.png +0 -0
plots/dedup_impact_simulation.png +0 -0
plots/fineweb-recipe.png +0 -0
plots/fineweb_ablations.png +0 -0
plots/fineweb_all_filters.png +0 -0
plots/minhash_parameters_comparison.png +0 -0
plots/removed_data_cross_dedup.png +0 -0
plots/score_by_dump.png +0 -0
plots/wet_comparison.png +0 -0

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: Blogpost
-emoji: 👁
 colorFrom: pink
-colorTo: yellow
 sdk: static
 pinned: false
 ---

 ---
+title: "FineWeb: 15T tokens of high quality web data"
+emoji: 🍷
 colorFrom: pink
+colorTo: red
 sdk: static
 pinned: false
 ---

banner.png ADDED Viewed

bibliography.bib ADDED Viewed

	@@ -0,0 +1,108 @@

+@article{gregor2015draw,
+  title={DRAW: A recurrent neural network for image generation},
+  author={Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan},
+  journal={arXiv preprint arXiv:1502.04623},
+  year={2015},
+  url ={https://arxiv.org/pdf/1502.04623.pdf}
+}
+@article{mercier2011humans,
+  title={Why do humans reason? Arguments for an argumentative theory},
+  author={Mercier, Hugo and Sperber, Dan},
+  journal={Behavioral and brain sciences},
+  volume={34},
+  number={02},
+  pages={57--74},
+  year={2011},
+  publisher={Cambridge Univ Press},
+  doi={10.1017/S0140525X10000968}
+}
+@article{dong2014image,
+  title={Image super-resolution using deep convolutional networks},
+  author={Dong, Chao and Loy, Chen Change and He, Kaiming and Tang, Xiaoou},
+  journal={arXiv preprint arXiv:1501.00092},
+  year={2014},
+  url={https://arxiv.org/pdf/1501.00092.pdf}
+}
+@article{dumoulin2016adversarially,
+  title={Adversarially Learned Inference},
+  author={Dumoulin, Vincent and Belghazi, Ishmael and Poole, Ben and Lamb, Alex and Arjovsky, Martin and Mastropietro, Olivier and Courville, Aaron},
+  journal={arXiv preprint arXiv:1606.00704},
+  year={2016},
+  url={https://arxiv.org/pdf/1606.00704.pdf}
+}
+@article{dumoulin2016guide,
+  title={A guide to convolution arithmetic for deep learning},
+  author={Dumoulin, Vincent and Visin, Francesco},
+  journal={arXiv preprint arXiv:1603.07285},
+  year={2016},
+  url={https://arxiv.org/pdf/1603.07285.pdf}
+}
+@article{gauthier2014conditional,
+  title={Conditional generative adversarial nets for convolutional face generation},
+  author={Gauthier, Jon},
+  journal={Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester},
+  volume={2014},
+  year={2014},
+  url={http://www.foldl.me/uploads/papers/tr-cgans.pdf}
+}
+@article{johnson2016perceptual,
+  title={Perceptual losses for real-time style transfer and super-resolution},
+  author={Johnson, Justin and Alahi, Alexandre and Fei-Fei, Li},
+  journal={arXiv preprint arXiv:1603.08155},
+  year={2016},
+  url={https://arxiv.org/pdf/1603.08155.pdf}
+}
+@article{mordvintsev2015inceptionism,
+  title={Inceptionism: Going deeper into neural networks},
+  author={Mordvintsev, Alexander and Olah, Christopher and Tyka, Mike},
+  journal={Google Research Blog},
+  year={2015},
+  url={https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html}
+}
+@misc{mordvintsev2016deepdreaming,
+  title={DeepDreaming with TensorFlow},
+  author={Mordvintsev, Alexander},
+  year={2016},
+  url={https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/deepdream/deepdream.ipynb},
+}
+@article{radford2015unsupervised,
+  title={Unsupervised representation learning with deep convolutional generative adversarial networks},
+  author={Radford, Alec and Metz, Luke and Chintala, Soumith},
+  journal={arXiv preprint arXiv:1511.06434},
+  year={2015},
+  url={https://arxiv.org/pdf/1511.06434.pdf}
+}
+@inproceedings{salimans2016improved,
+  title={Improved techniques for training gans},
+  author={Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi},
+  booktitle={Advances in Neural Information Processing Systems},
+  pages={2226--2234},
+  year={2016},
+  url={https://arxiv.org/pdf/1606.03498.pdf}
+}
+@article{shi2016deconvolution,
+  title={Is the deconvolution layer the same as a convolutional layer?},
+  author={Shi, Wenzhe and Caballero, Jose and Theis, Lucas and Huszar, Ferenc and Aitken, Andrew and Ledig, Christian and Wang, Zehan},
+  journal={arXiv preprint arXiv:1609.07009},
+  year={2016},
+  url={https://arxiv.org/pdf/1609.07009.pdf}
+}
+@misc{openai2018charter,
+  author={OpenAI},
+  title={OpenAI Charter},
+  type={Blog},
+  number={April 9},
+  year={2018},
+  url={https://blog.openai.com/charter},
+}

index.html CHANGED Viewed

@@ -1,19 +1,662 @@
 <!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
-</html>

 <!doctype html>
+<head>
+    <script src="https://distill.pub/template.v2.js"></script>
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <meta charset="utf8">
+    <title>FineWeb: 15T tokens of high quality web data</title>
+</head>
+<body>
+<d-front-matter>
+    <script id='distill-front-matter' type="text/json">{
+    "title": "FineWeb: 15T tokens of high quality web data",
+    "description": "This blog covers the FineWeb recipe, why more deduplication is not always better and some interesting findings on the difference in quality of CommonCrawl dumps.",
+    "published": "May 28, 2024",
+    "authors": [
+      {
+        "author":"Guilherme Penedo",
+        "authorURL":"https://huggingface.co/guipenedo",
+        "affiliations": [{"name": "HuggingFace"}]
+      },
+      {
+        "author":"Hynek Kydlíček",
+        "authorURL":"https://huggingface.co/hynky"
+      },
+      {
+        "author":"Leandro Werra",
+        "authorURL":"https://huggingface.co/lvwerra"
+      },
+      {
+        "author":"Thomas Wolf",
+        "authorURL":"https://huggingface.co/thomwolf"
+      }
+    ],
+    "katex": {
+      "delimiters": [
+        {"left": "$$", "right": "$$", "display": false}
+      ]
+    }
+  }
+    </script>
+</d-front-matter>
+<d-title>
+    <figure style="grid-column: page; mix-blend-mode: multiply;">
+        <img src="banner.png" alt="FineWeb">
+    </figure>
+    <!--    <figure style="grid-column: page; margin: 1rem 0;"><img src="banner.png"-->
+    <!--                                                            style="width:100%; border: 1px solid rgba(0, 0, 0, 0.2);"/>-->
+    <!--    </figure>-->
+</d-title>
+<d-byline></d-byline>
+<d-article>
+    <p>We have recently released 🍷FineWeb, our new large scale
+        (15T tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
+        download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
+    <p>As 🍷FineWeb has gathered a lot of interest from the
+        community, we decided to further explain the steps involved in creating it, our processing decisions and
+        some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
+    <p><strong>TLDR:</strong> This blog covers the FineWeb
+        recipe, why more deduplication is not always better and some interesting findings on the difference in
+        quality of CommonCrawl dumps.</p>
+    <hr/>
+    <h1>Preamble</h1>
+    <h2>Sourcing the data</h2>
+    <p>A common question we see asked regarding web datasets used
+        to train LLMs is “where do they even get all that data?” There are generally two options:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">you either crawl it yourself, like <a
+                href="https://platform.openai.com/docs/gptbot">OpenAI</a> or <a
+                href="https://darkvisitors.com/agents/claudebot">Anthropic</a> seem to do
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">you use a public repository of crawled webpages, like the one maintained by
+            the non-profit <a href="https://commoncrawl.org/">CommonCrawl</a></li>
+    </ul>
+    <p>For FineWeb, similarly to what was done for a large number
+        of other public datasets, we used <a href="https://commoncrawl.org/">CommonCrawl</a> as a starting point.
+        They have been crawling the web since 2007 (long before LLMs were a thing) and release a new dump usually
+        every 1 or 2 months, which can be freely downloaded. </p>
+    <p>As an example, their latest crawl (2024-10) contains 3.16
+        billion web pages, totaling 424.7 TiB of uncompressed content (the size changes from dump to dump). There
+        are 95 dumps since 2013 and 3 dumps from 2008 to 2012, which are in a different (older) format. </p>
+    <h2>Processing at scale</h2>
+    <p>Given the sheer size of the data involved, one of the main
+        challenges we had to overcome was having a modular, scalable codebase that would allow us to quickly iterate
+        on our processing decisions and easily try out new ideas, while appropriately parallelizing our workloads
+        and providing clear insights into the data. </p>
+    <p>For this purpose, we developed <a
+            href="https://github.com/huggingface/datatrove"><code>datatrove</code></a>, an open-source data
+        processing library that allowed us to seamlessly scale our filtering and deduplication setup to thousands of
+        CPU cores. All of the data processing steps involved in the creation of FineWeb used this <a
+                href="https://github.com/huggingface/datatrove">library</a>.</p>
+    <h2>What is clean, good data?</h2>
+    <p>This is probably the main question to keep in mind when
+        creating a dataset. A good first lesson is that data that would intuitively be considered high quality by a
+        human may not be necessarily the best data (or at least not all that you need) to train a good model on.</p>
+    <p>It is still common to train a model on a given corpus
+        (wikipedia, or some other web dataset considered clean) and use it to check the perplexity on the dataset
+        that we were trying to curate. Unfortunately this does not always correlate with performance on downstream
+        tasks, and so another often used approach is to train small models (small because training models is
+        expensive and time consuming, and we want to be able to quickly iterate) on our dataset and evaluate them on
+        a set of evaluation tasks. As we are curating a dataset for pretraining a generalist LLM, it is important to
+        choose a diverse set of tasks and try not to overfit to any one individual benchmark.</p>
+    <p>Another way to evaluate different datasets would be to
+        train a model on each one and have humans rate and compare the outputs of each one (like on the <a
+                href="https://chat.lmsys.org/">LMSYS Chatbot Arena</a>). This would arguably provide the most
+        reliable results in terms of representing real model usage, but getting ablation results this way is too
+        expensive and slow.</p>
+    <p>The approach we ultimately went with was to train small
+        models and evaluate them on a set of benchmark tasks. We believe this is a reasonable proxy for the quality
+        of the data used to train these models.</p>
+    <h3>Ablations and evaluation setup</h3>
+    <p>To be able to compare the impact of a given processing
+        step, we would train 2 models, one where the data included the extra step and another where this step was
+        ablated (cut/removed). These 2 models would have the same number of parameters, architecture, and be trained
+        on an equal number of tokens and with the same hyperparameters — the only difference would be in the
+        training data. We would then evaluate each model on the same set of tasks and compare the average
+        scores.</p>
+    <p>Our ablation models were trained using <a
+            href="https://github.com/huggingface/nanotron"><code>nanotron</code></a> with this config [<strong>TODO:
+        INSERT SIMPLIFIED NANOTRON CONFIG HERE</strong>]. The models had 1.82B parameters, used the Llama
+        architecture with a 2048 sequence length, and a global batch size of ~2 million tokens. For filtering
+        ablations we mostly trained on ~28B tokens (which is roughly the Chinchilla optimal training size for this
+        model size).</p>
+    <p>We evaluated the models using <a
+            href="https://github.com/huggingface/lighteval/"><code>lighteval</code></a>. We tried selecting
+        benchmarks that would provide good signal at a relatively small scale (small models trained on only a few
+        billion tokens). Furthermore, we also used the following criteria when selecting benchmarks:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">small variance between runs trained on different samplings of the same
+            dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
+            resulting scores to have as little noise as possible
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">performance increasing monotonically (or close) over a training run:
+            ideally, as the number of seen tokens increases, the performance on this benchmark should not decrease
+            (should not be too noisy)
+        </li>
+    </ul>
+    <p>You can find the full list of tasks and prompts we used <a
+            href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>. To
+        have results quickly we capped longer benchmarks at 1000 samples (wall-clock evaluation taking less than 5
+        min on a single node of 8 GPUs - done in parallel to the training).</p>
+    <hr />
+    <h1>The FineWeb recipe</h1>
+    <p>In the next subsections we will explain each of the steps
+        taken to produce the FineWeb dataset. You can find a full reproducible <code>datatrove</code> config <a
+                href="https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py">here</a>.</p>
+    <style>
+         .neighborhood-figure-container {grid-column: screen; width: 100%; margin: auto; margin-top: 30px; margin-bottom: 30px; padding-top: 20px; padding-bottom: 10px; border-bottom: 1px solid #EEE; border-top: 1px solid #EEE;}
+    </style>
+    <div class="neighborhood-figure-container">
+        <figure class="image">
+            <img style="width:708px" src="plots/fineweb-recipe.png"/>
+        </figure>
+    </div>
+    <h2>Starting point: text extraction</h2>
+    <p>CommonCrawl data is available in two main formats: WARC
+        and WET. <strong>WARC </strong>(Web ARChive format) files contain the raw data from the crawl, including the
+        full page HTML and request metadata. <strong>WET</strong> (WARC Encapsulated Text) files provide a text only
+        version of those websites.</p>
+    <p>A large number of datasets take the WET files as their
+        starting point. In our experience the default text extraction (extracting the main text of a webpage from
+        its HTML) used to create these WET files is suboptimal and there are a variety of open-source libraries that
+        provide better text extraction (by, namely, keeping less boilerplate content/navigation menus). We extracted
+        the text content from the WARC files using the <a href="https://trafilatura.readthedocs.io/en/latest/">trafilatura</a>
+        library. It is important to note, however, that text extraction is one of the most costly steps of our
+        processing, so we believe that using the readily available WET data could be a reasonable trade-off for
+        lower budget teams.</p>
+    <p>To validate this decision, we processed the 2019-18 dump
+        directly using the WET files and with text extracted from WARC files using trafilatura. We applied the same
+        processing to each one (our base filtering+minhash, detailed below) and trained two models. While the
+        resulting dataset is considerably larger for the WET data (around 254BT), it proves to be of much worse
+        quality than the one that used trafilatura to extract text from WARC files (which is around 200BT). Many of
+        these additional tokens on the WET files are unnecessary page boilerplate.</p>
+    <figure class="image"><a href="plots/wet_comparison.png"><img
+            style="width:640px" src="plots/wet_comparison.png"/></a></figure>
+    <h2>Base filtering</h2>
+    <p>Filtering is an important part of the curation process. It
+        removes part of the data (be it words, lines, or full documents) that would harm performance and is thus
+        deemed to be “lower quality”.</p>
+    <p>As a basis for our filtering we used part of the setup
+        from <a href="https://arxiv.org/abs/2306.01116">RefinedWeb</a>. Namely, we:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Applied URL filtering using a <a
+                href="https://dsi.ut-capitole.fr/blacklists/">blocklist</a> to remove adult content
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Applied a <a
+                href="https://fasttext.cc/docs/en/language-identification.html">fastText language classifier</a> to
+            keep only English text with a score ≥ 0.65
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Applied quality and repetition filters from the <a
+                href="https://arxiv.org/abs/2112.11446">Gopher</a> paper (using the default thresholds)
+        </li>
+    </ul>
+    <p>After applying this filtering to each of the text
+        extracted dumps (there are currently 95 dumps) we obtained roughly 36 trillion tokens of data (when
+        tokenized with the <code>gpt2</code> tokenizer).</p>
+    <h2>Deduplication</h2>
+    <p>Deduplication is another important step, specially for web
+        datasets. Methods to deduplicate datasets attempt to remove redundant/repeated data. Deduplication is one of
+        the most important steps when creating large web datasets for LLMs.</p>
+    <h3>Why deduplicate?</h3>
+    <p>The web has many aggregators, mirrors, templated pages or
+        just otherwise repeated content spread over different domains and webpages. Often, these duplicated pages
+        can be introduced by the crawler itself, when different links point to the same page. </p>
+    <p>Removing these duplicates (deduplicating) has been <a
+            href="https://arxiv.org/abs/2107.06499">linked to an improvement in model performance</a> and a <a
+            href="https://arxiv.org/abs/2202.07646">reduction in memorization of pretraining data</a>, which might
+        allow for better generalization. Additionally, the performance uplift can also be tied to increased training
+        efficiency: by removing duplicated content, for the same number of training tokens, a model will have seen
+        more diverse data.</p>
+    <p>There are different ways to identify and even define
+        duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
+        efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
+        similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
+        documents (or lines, paragraphs, or whatever other granularity level being used).</p>
+    <h3>Our deduplication parameters</h3>
+    <p>Similarly to RefinedWeb, we decided to apply MinHash, a
+        fuzzy hash based deduplication technique. We chose to compute minhashes on each document’s 5-grams, using
+        112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
+        75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
+    <p>This would mean that for two documents with a similarity (<code>s</code>)
+        of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
+        92% and 98.8% respectively (<code>1-(1-s^8)^14</code>). See the plot below for a match probability
+        comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
+        buckets of 20 hashes (that requires a substantially larger amount of compute resources):</p>
+    <figure class="image"><a
+            href="plots/minhash_parameters_comparison.png"><img style="width:567px"
+                                                                src="plots/minhash_parameters_comparison.png"/></a>
+    </figure>
+    <p>While the high number of hash functions in RefinedWeb
+        allows for a steeper, more well defined cut off, we believe the compute and storage savings are a reasonable
+        trade off.</p>
+    <h3>More deduplication is always better, right?</h3>
+    <p>Our initial approach was to take the entire dataset (all
+        95 dumps) and deduplicate them as one big dataset using MinHash.</p>
+    <p>We did this in an iterative manner: starting with the most
+        recent dump (which at the time was 2023-50) and taking the oldest one last, we would deduplicate each dump
+        not only against itself but also by removing any matches with duplicates from the previously processed
+        dumps. </p>
+    <p>For instance, for the second most recent dump (2023-40 at
+        the time), we deduplicated it against the most recent one in addition to itself. In particular, the oldest
+        dump was deduplicated against all other dumps. As a result, more data was removed in the oldest dumps (last
+        to be deduplicated) than in the most recent ones.</p>
+    <p>Deduplicating the dataset in this manner resulted in 4
+        trillion tokens of data, but, quite surprisingly for us, when training on a randomly sampled 350 billion
+        tokens subset, the model showed no improvement over one trained on the non deduplicated data (see orange and
+        green curve below), scoring far below its predecessor RefinedWeb on our aggregate of tasks.</p>
+    <figure class="image"><a href="plots/dedup_all_dumps_bad.png"><img
+            style="width:576px" src="plots/dedup_all_dumps_bad.png"/></a></figure>
+    <p>This was quite puzzling as our intuition regarding web
+        data was that more deduplication would always result in improved performance. We decided to take a closer
+        look at one of the oldest dumps, dump 2013-48:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">pre deduplication, this dump had ~490 billion tokens</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">after our iterative MinHash, ~31 billion tokens remained (94% of data
+            removed)
+        </li>
+    </ul>
+    <p>As an experiment, we tried training two models on 28BT
+        sampled from the following data from 2013-48:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">the fully deduplicated remaining ~31 billion tokens (<em>originally kept
+            data</em>)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">171 billion tokens obtained by individually deduplicating (without
+            considering the other dumps) the ~460 billion tokens that had been removed from this dump in the
+            iterative dedup process (<em>originally removed data</em>)
+        </li>
+    </ul>
+    <figure class="image"><a
+            href="plots/removed_data_cross_dedup.png"><img style="width:576px"
+                                                           src="plots/removed_data_cross_dedup.png"/></a></figure>
+    <p>These results show that, for this older dump where we were
+        removing over 90% of the original data, the data that was kept was actually <em>worse</em> than the data
+        removed (considered independently from all the other dumps).</p>
+    <h3>Taking a step back: individual dump dedup</h3>
+    <p>We then tried an alternative approach: we deduplicated
+        each dump with MinHash individually (without considering the other dumps). This resulted in 20 trillion
+        tokens of data.</p>
+    <p>When training on a random sample from this dataset we see
+        that it now matches RefinedWeb’s performance (blue and red curves below):</p>
+    <figure class="image"><a
+            href="plots/cross_ind_unfiltered_comparison.png"><img style="width:576px"
+                                                                  src="plots/cross_ind_unfiltered_comparison.png"/></a>
+    </figure>
+    <p>We hypothesis that the main improvement gained from
+        deduplication is the removal of very large clusters that are present in every single dump (you will find
+        some examples of these clusters on the RefinedWeb paper, each containing <em>hundreds of thousands</em> of
+        documents) and that further deduplication for low number of deduplications (less than ~100 i.e. the number
+        of dumps) actually harm performance: data that does not find a duplicate match in any other dump might
+        actually be worse quality/more out of distribution (as evidenced by the results on the 2013-48 data). </p>
+    <p>While you might see some performance improvement when
+        deduplicating a few dumps together, at the scale of all the dumps this upsampling of lower quality data side
+        effect seems to have a great impact.</p>
+    <p>One possibility to consider is that as filtering quality
+        improves, this effect may not be as prevalent, since the filtering might be able to remove some of this
+        lower quality data. We also experimented with applying different, and often “lighter”, deduplication
+        approaches on top of the individually deduplicated dumps. You can read about them further below.</p>
+    <h3>A note on measuring the effect of deduplication</h3>
+    <p>Given the nature of deduplication, its effect is not
+        always very visible in a smaller slice of the dataset (such as 28B tokens, the size we used for our
+        filtering ablations). Furthermore, one must consider the fact that there are specific effects at play when
+        deduplicating across all CommonCrawl dumps, as some URLs/pages are recrawled from one dump to the next.</p>
+    <p>To visualize the effect of scaling the number of training
+        tokens on measuring deduplication impact, we considered the following (very extreme and unrealistic
+        regarding the degree of duplication observed) theoretical scenario:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">there are 100 CommonCrawl dumps (actually roughly true)</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">each dump has been perfectly individually deduplicated (every single
+            document in it is unique)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">each dump is a perfect copy of each other (maximum possible duplication
+            across dumps, effectively the worst case scenario)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">each dump has 200 billion tokens (for a total of 20 trillion, the resulting
+            size of our individual dedup above)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">each dump is made up of documents of 1k tokens (200M documents per dump)
+        </li>
+    </ul>
+    <p>We then simulated uniformly sampling documents from this
+        entire dataset of 20 trillion tokens, to obtain subsets of 1B, 10B, 100B, 350B and 1T tokens. In the image
+        below you can see how often each document would be repeated.</p>
+    <figure class="image"><a href="plots/dedup_impact_simulation.png"><img
+            style="width:708px" src="plots/dedup_impact_simulation.png"/></a></figure>
+    <p>For 1B almost all documents would be unique
+        (#duplicates=1), despite the fact that in the entire dataset each document is repeated 100 times (once per
+        dump). We start seeing some changes at the 100B scale (0.5% of the total dataset), with a large number of
+        documents being repeated twice, and a few even 4-8 times. At the larger scale of 1T (5% of the total
+        dataset), the majority of the documents are repeated up to 8 times, with a some being repeated up to 16
+        times. </p>
+    <p>We ran our performance evaluations for the deduplicated
+        data at the 350B scale, which would, under this theoretical scenario, be made up of a significant portion of
+        documents duplicated up to 8 times. This simulation illustrates the inherent difficulties associated with
+        measuring deduplication impact on the training of LLMs, once the biggest document clusters have been
+        removed.</p>
+    <h3>Other (failed) approaches</h3>
+    <p>We attempted to improve the performance of the
+        independently minhash deduped 20T of data by further deduplicating it with the following methods</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">URL deduplication, where we only kept one document per normalized
+            (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Line deduplication:
+            <ul class="bulleted-list">
+                <li style="list-style-type:circle">remove all but 1 occurrence of each duplicated line (77.8% of
+                    tokens dropped, 4.4T left) — <em>FineWeb line dedup</em></li>
+            </ul>
+            <ul class="bulleted-list">
+                <li style="list-style-type:circle">same as above, but only removing duplicate lines with at least 10
+                    words and dropping documents with fewer than 3 sentences after deduplication (85% of tokens
+                    dropped, 2.9T left) — <em>FineWeb line dedup w/ min words</em></li>
+            </ul>
+            <ul class="bulleted-list">
+                <li style="list-style-type:circle">remove all but 1 occurrence of each span of 3 duplicated lines
+                    with all numbers replaced by 0 (80.9% of tokens removed, 3.7T left) — <em>FineWeb 3-line
+                        dedup</em></li>
+            </ul>
+        </li>
+    </ul>
+    <p>The performance of the models trained on each of these was
+        consistently worse (even if to different degrees) than that of the original independently deduplicated
+        data:</p>
+    <figure class="image"><a href="plots/Untitled.png"><img
+            style="width:708px" src="plots/Untitled.png"/></a></figure>
+    <h2>Additional filtering</h2>
+    <p>By this point we had reached the same performance as
+        RefinedWeb, but on our aggregate of tasks, another heavily filtered dataset, <a
+                href="https://arxiv.org/abs/1910.10683">the C4 dataset</a>, still showed stronger performance (with
+        the caveat that it is a relatively small dataset for current web-scale standards).</p>
+    <p>We therefore set out to find new filtering steps that
+        would, at first, allow us to match the performance of C4 and eventually surpass it. A natural starting point
+        was to look into the processing of C4 itself.</p>
+    <h3>C4: A dataset that has stood the test of time</h3>
+    <p>The <a href="https://huggingface.co/datasets/c4">C4
+        dataset</a> was first released in 2019. It was obtained from the <code>2019-18</code> CommonCrawl dump by
+        removing non english data, applying some heuristic filters on both the line and document level,
+        deduplicating on the line level and removing documents containing words from a word blocklist.</p>
+    <p>Despite its age and limited size (around 175B gpt2
+        tokens), models trained on this dataset have strong performance, excelling in particular on the Hellaswag
+        benchmark, one of the benchmarks in our “early signal” group with the stronger signal and highest
+        signal-over-noise ratio. As such, it has stayed a common sub-set of typical LLM training, for instance in in
+        <a href="https://arxiv.org/abs/2302.13971">the relatively recent Llama1 model</a>. We experimented applying
+        each of the different filters used in C4 to a baseline of the independently deduped FineWeb 2019-18 dump
+        (plot smoothed with a 3 checkpoints sliding window):</p>
+    <figure class="image"><a href="plots/c4_filters.png"><img
+            style="width:708px" src="plots/c4_filters.png"/></a></figure>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">applying “All filters” (drop lines not ending on punctuation marks,
+            mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
+            ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance (purple versus
+            pink curves).
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">The curly bracket filter, and the word lengths filter only give a small
+            boost, removing 2.8% and 4.3% of tokens, respectively
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">The terminal punctuation filter, by itself, gives the biggest individual
+            boost, but removes <em>around 30%</em> of all tokens (!)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">The lorem_ipsum, javascript and policy rules each remove &lt;0.5% of
+            training tokens, so we did not train on them individually
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">All filters except the very destructive terminal_punct perform better than
+            terminal_punct by itself, while removing less in total (~7%)
+        </li>
+    </ul>
+    <p>We decided to apply all C4 filters mentioned above except
+        the terminal punctuation one. We validated these results with a longer run, which you will find in a plot in
+        the next section.</p>
+    <h3>A statistical approach to develop heuristic filters</h3>
+    <p>To come up with new possible filtering rules, we collected
+        a very large list of statistics (statistical metrics) — over <strong>50</strong> �� from different reference
+        datasets (C4, RefinedWeb, etc) and from a select list of our processed dumps, on both the independently
+        minhashed version and the result from the (worse quality) full dedup. This allowed us to compare the
+        different datasets at a macro level, by looking at the distribution of these metrics for each one.</p>
+    <p>The collected statistics ranged from common document-level
+        metrics (e.g. number of lines, avg. line/word length, etc) to inter-document repetition metrics (gopher
+        inspired). Perhaps not too surprisingly given our findings for deduplication, we found significant
+        disparities in most of the metrics for the two deduplication methods. For instance, the <code>line-char-duplicates</code>
+        metric (nb. of characters in duplicated lines / nb. characters), roughly doubled from the independent dedup
+        (0.0053 for 2015-22 and 0.0058 for 2013-48), to the full dedup (0.011 for 2015-22 and 0.01 for 2013-48),
+        indicating that the latter had higher inter-document repetition.</p>
+    <p>Working under the assumption that these differences were
+        caused by lower quality data on the full dedup version, we inspected histograms and manually defined
+        thresholds for the metrics where these differences were starker. This process yielded 17 candidate
+        threshold-filter pairs. In the image below, you can see 3 of these histograms.</p>
+    <figure class="image"><a href="plots/Untitled%201.png"><img
+            style="width:790px" src="plots/Untitled%201.png"/></a></figure>
+    <p>To assess the effectiveness of these newly created
+        filters, we conducted <strong>28B tokens </strong>ablation runs on the <strong>2019-18 crawl</strong>. Out
+        of all those runs, we identified three filters (the ones based on the histograms above) that demonstrated
+        the most significant improvements on the aggregate score:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Remove documents where the fraction of lines ending with punctuation ≤ 0.12
+            (10.14% of tokens removed) — vs the 30% from the original C4 terminal punct filter
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Remove documents where the fraction of characters in duplicated lines ≥ 0.1
+            (12.47% of tokens removed) — the original Gopher threshold for this ratio is ≥ 0.2
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">Remove documents where the fraction of lines shorter than 30 characters ≥
+            0.67 (3.73% of tokens removed)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">When applying the 3 together, ~22% of tokens were removed</li>
+    </ul>
+    <figure class="image"><a href="plots/Untitled%202.png"><img
+            style="width:708px" src="plots/Untitled%202.png"/></a></figure>
+    <hr />
+    <h1>The final dataset</h1>
+    <p>The final FineWeb dataset comprises 15T tokens and
+        includes the following previously mentioned steps, in order, each providing a performance boost on our group
+        of benchmark tasks:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">base filtering</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">independent MinHash deduplication per dump</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">a selection of C4 filters</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">our custom filters (mentioned in the previous section)</li>
+    </ul>
+    <figure class="image"><a href="plots/fineweb_all_filters.png"><img
+            style="width:708px" src="plots/fineweb_all_filters.png"/></a></figure>
+    <p>We compared 🍷 FineWeb with the following datasets:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a
+                href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a>
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/c4">C4</a></li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/allenai/dolma">Dolma v1.6</a> (the
+            CommonCrawl part)
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a href="https://huggingface.co/datasets/EleutherAI/pile">The Pile</a></li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a
+                href="https://huggingface.co/datasets/cerebras/SlimPajama-627B">SlimPajama</a>
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc"><a
+                href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">RedPajama2</a>
+            (deduplicated)
+        </li>
+    </ul>
+    <p>You will find these models on <a
+            href="https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32">this
+        collection</a>. We have uploaded checkpoints at every 1000 training steps. You will also find our full <a
+            href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv">evaluation
+        results here</a>.</p>
+    <figure class="image"><a href="plots/fineweb_ablations.png"><img
+            style="width:708px" src="plots/fineweb_ablations.png"/></a></figure>
+    <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
+        FineWeb:</p>
+    <figure class="image"><a href="plots/Untitled%203.png"><img
+            style="width:4587px" src="plots/Untitled%203.png"/></a></figure>
+    <hr />
+    <h1>Just like fine wine, not all crawls are created
+        equal</h1>
+    <p>During our ablation runs, we observed that certain crawls
+        outperformed others by a significant margin. To investigate this phenomenon, we conducted 27B token runs for
+        each dump (we used the version with base filtering + ind dedup), with 2 trainings per dump, where each used
+        a different data subset. We trained 190 such models, totaling over 60k H100 GPU-hours. We subsequently took
+        the last 3 checkpoints for both seeds and plotted the average of these 6 data points per dump. </p>
+    <p>The plot below clearly shows that some dumps perform far
+        worse than others. Each year has a different color, and the number of crawls per year also changes.</p>
+    <figure class="image"><a href="plots/score_by_dump.png"><img
+            style="width:708px" src="plots/score_by_dump.png"/></a></figure>
+    <p>We identified 5 main relevant time intervals:</p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">2013 to 2016: relatively stable, average quality</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">2017 to 2018: high quality, with a drop by the end of 2018</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">2019 to 2021: high quality, steadily increase</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">2021-49 and 2022: very large drop in performance, followed by worse quality
+            dumps
+        </li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">2023 and 2024-10: almost exponential improvement. In particular, 2023-50
+            and 2024-10 are by far the best dumps
+        </li>
+    </ul>
+    <p>One possibility to improve performance when training
+        models on &lt; 15T would be to train on FineWeb while excluding the worst quality CommonCrawl dumps.</p>
+    <p>We conducted further analysis to investigate the factors
+        causing these differences from dump to dump. In particular, we considered 3 potential causes: </p>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">large sudden changes in the list of crawled URLs;</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">synthetic (LLM generated) data;</li>
+    </ul>
+    <ul class="bulleted-list">
+        <li style="list-style-type:disc">benchmark contamination;</li>
+    </ul>
+    <p>We go over each one in the following sections.</p>
+    <h3>Changes in the most frequent URLs [HAVE TO RECHECK]</h3>
+    <p>For each crawl from 2021-10 onwards, we gathered a list of
+        the 60k most frequent <strong>FQDNs</strong> (fully qualified domain name). We then calculated the <a
+                href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard similarity</a> between consecutive
+        crawls. A high value means that a crawl/dump has many of the same FQDNs as the dump immediately preceding
+        it, while a small value means that a considerable number of top 60k FQDNs were downsampled or removed, or
+        that alternatively new FQDNs were added to the top 60k.</p>
+    <figure class="image"><a href="plots/Untitled%204.png"><img
+            style="width:5026px" src="plots/Untitled%204.png"/></a></figure>
+    <p>The data indicates three significant changes:
+        2021-43/2021-49, 2022-33/2022-40, and 2023-40/2023-50.</p>
+    <p>The explanation for the changes between 2022-33/2022-40
+        and 2023-40/2023-50 is straightforward: CommonCrawl accidentally did not index several popular suffixes,
+        such as .co.uk, as documented on <a href="https://commoncrawl.org/errata/co-uk-cctld-not-included">this
+            erratum</a>. This particular change does not seem particularly correlated on the overall dump quality.
+    </p>
+    <p>As to the shift from 2021-43 to 2021-49, which coincides
+        with a sharp performance drop, roughly half (~30k) of the former’s top 60k FQDNs are not present in the
+        latter’s list of top 60k FQDNs, and the dump size itself also decreased (19% reduction in WARC size, and a
+        28% token reduction after deduplication). </p>
+    <p>We were unable to find a clear reason for this drastic
+        change, but upon reaching out to CommonCrawl, we were informed that these differences likely stem from a
+        major update in adult content and malicious site blocking. It is therefore possible that the new updated
+        adult site filter could have also removed a high number of high quality domains resulting in poor
+        performance of the crawl. <strong>[TODO: change this framing a bit, it seems to suggest adult content is
+            high quality for LLMs]</strong></p>
+    <h3>Synthetic data contamination [HAVE TO RECHECK]</h3>
+    <p>Secondly, we wondered if part of the changes in
+        performance on recent dumps could be attributed to the presence of a larger quantity of synthetic data (data
+        generated by LLMs). Such a change would not be surprising due to the recent increase in popularity of LLMs,
+        notably of ChatGPT.</p>
+    <p>Since, to the best of our knowledge, there is no fool
+        proof method to detect synthetic data, we opted to use a proxy metric: we measured the frequency of the
+        following words: <code>delve, as a large language model, it&#x27;s important to note, rich tapestry,
+            intertwined, certainly!, dive into</code>, which are words commonly used by ChatGPT.</p>
+    <p>It is important to note that not all samples containing
+        one of these phrases were necessarily generated by ChatGPT (and also that many ChatGPT generated samples do
+        not contain any of these phrases), but assuming that the amount of synthetic data were to not change across
+        dumps, one would expect these frequencies to remain approximately constant over time.</p>
+    <p>The results are shown in the following graph:</p>
+    <figure class="image"><a href="plots/Untitled%205.png"><img
+            style="width:4156px" src="plots/Untitled%205.png"/></a></figure>
+    <p>While the frequency remained approximately constant until
+        2023-14 (ChatGPT was released at the end of 2022), not only do we find a steep increase of our proxy metric
+        in recent crawls, as the proxy metric also correlates well with the agg score, with a pearson correlation of
+        <strong>0.590</strong>. It is therefore possible that synthetic data has positively impacted performance in
+        our selected tasks for these most recent dumps (with all limitations in interpretation from a single
+        correlation measurement without intervention of randomization or any causality tools being used here). In
+        particular, it could explain why the 2023-50 and 2024-10 dumps have such a strong performance. </p>
+    <h3>Benchmarks contamination [HAVE TO RECHECK]</h3>
+    <p>Also, most of our used benchmarks were introduced around
+        <strong>2019</strong>. It’s thus possible that the 2019-XX 2021-43 performance increase might be caused by
+        higher benchmark contamination in those crawls. Similarly, the recent increase in LLM popularity and
+        evaluations, might have increased the contamination in recent benchmarks, explaining the score improvements
+        of the two most recent crawls. <strong>[NOTE: the plot does not seem to support this at all]</strong></p>
+    <figure class="image"><a href="plots/Untitled%206.png"><img
+            style="width:708px" src="plots/Untitled%206.png"/></a></figure>
+    <hr />
+    <h1>Next steps</h1>
+    <p>We want to continue improving FineWeb and will also
+        release a technical report with more details soon.</p>
+    <p>Adapting the FineWeb recipe [wip]</p>
+</d-article>
+<d-appendix>
+    <h3>Contributions</h3>
+    <p>Some text describing who did what.</p>
+    <h3>Reviewers</h3>
+    <p>Some text with links describing who reviewed the article.</p>
+    <d-bibliography src="bibliography.bib"></d-bibliography>
+</d-appendix>
+</body>