Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on Jun 2, 2024

Commit

3985998

unverified ·

1 Parent(s): 9a51a2c

final pass over the text

Browse files

Files changed (2) hide show

dist/index.html +21 -21
src/index.html +21 -21

dist/index.html CHANGED Viewed

@@ -265,9 +265,9 @@
         fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
-    <p>This would mean that for two documents with a similarity ($$s$$)
         of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
-        92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
     <div class="main-plot-container">
@@ -280,22 +280,22 @@
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
-    <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
         90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
         not only within itself, but removing any document matching any other documents in the previously processed
         dumps. </p>
     <p>For instance, for the second most recent dump (2023-40 at
-        the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
     <p>Deduplicating the dataset in this manner resulted in 4
         trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
-        tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
     <div class="main-plot-container">
         <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
         <div id="plot-all_dumps_bad"></div>
     </div>
-    <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
     <ul>
         <li>pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
@@ -326,7 +326,7 @@
         removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
             data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
     <h4>Taking a step back: individual dump dedup</h4>
-    <p>We decided to experimence with alternative approaches: we deduplicated
         each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
@@ -362,7 +362,7 @@
     </ul>
     <ul>
         <li>each dump has been perfectly individually deduplicated (every single
-            document in a is unique in this dump)
         </li>
     </ul>
     <ul>
@@ -399,8 +399,8 @@
         removed.</p>
     <h4>Other (failed) global approaches</h4>
-    <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
-        independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
@@ -431,7 +431,7 @@
         <div id="plot-dedup_attempts"></div>
     </div>
-    <h3>Filtering the data even more for quality</h3>
     <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
         RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
     <p>We therefore set out to find new filtering steps that
@@ -454,7 +454,7 @@
     <ul>
         <li>applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
-            ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
         </li>
     </ul>
     <ul>
@@ -535,7 +535,7 @@
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
-    <h3>The final FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
@@ -556,7 +556,7 @@
         <div id="plot-all_filtering_steps"></div>
     </div>
     <h4>Comparisons with other web-scale datasets</h4>
-    <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
@@ -597,7 +597,7 @@
         <figure><img src="assets/images/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
-    <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
     <h2>📚 FineWeb-Edu</h2>
@@ -605,7 +605,7 @@
         <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
         <figcaption style="font-style: italic; margin-top: 10px;">📚 FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
     </figure>
-    <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
@@ -614,15 +614,15 @@
     <h3>Annotating for educational quality at scale</h3>
     <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
-    <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
-    <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
-    <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
@@ -692,8 +692,8 @@
     <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
     <h2>Conclusion and looking forward</h2>
-    <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
-    <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
     <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 </d-article>

         fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
+    <p>This would mean that for two documents with a similarity (s)
         of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
+        92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
     <div class="main-plot-container">
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
+    <p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
         90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
         not only within itself, but removing any document matching any other documents in the previously processed
         dumps. </p>
     <p>For instance, for the second most recent dump (2023-40 at
+        the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
     <p>Deduplicating the dataset in this manner resulted in 4
         trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
+        tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
     <div class="main-plot-container">
         <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
         <div id="plot-all_dumps_bad"></div>
     </div>
+    <p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
     <ul>
         <li>pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
         removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
             data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
     <h4>Taking a step back: individual dump dedup</h4>
+    <p>We decided to experiment with an alternative approach: we deduplicated
         each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
     </ul>
     <ul>
         <li>each dump has been perfectly individually deduplicated (every single
+            document is unique in this dump)
         </li>
     </ul>
     <ul>
         removed.</p>
     <h4>Other (failed) global approaches</h4>
+    <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
+        independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
         <div id="plot-dedup_attempts"></div>
     </div>
+    <h3>Additional quality filtering</h3>
     <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
         RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
     <p>We therefore set out to find new filtering steps that
     <ul>
         <li>applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
+            ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filters" vs "C4" curves, respectively).
         </li>
     </ul>
     <ul>
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
+    <h3>The final 🍷 FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
         <div id="plot-all_filtering_steps"></div>
     </div>
     <h4>Comparisons with other web-scale datasets</h4>
+    <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
         <figure><img src="assets/images/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
+    <p>🍷 FineWeb is thus – to the best of our knowledge – the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
     <h2>📚 FineWeb-Edu</h2>
         <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
         <figcaption style="font-style: italic; margin-top: 10px;">📚 FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
     </figure>
+    <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. 📚 FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
     <h3>Annotating for educational quality at scale</h3>
     <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
+    <p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
+    <p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
+    <p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
     <h2>Conclusion and looking forward</h2>
+    <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
+    <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
     <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 </d-article>

src/index.html CHANGED Viewed

@@ -265,9 +265,9 @@
         fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
-    <p>This would mean that for two documents with a similarity ($$s$$)
         of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
-        92% and 98.8% respectively ($$1-(1-s^8)^{14}$$). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
     <div class="main-plot-container">
@@ -280,22 +280,22 @@
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
-    <p>We started the project with the assumption that <em>more deduplication is always better</em>, so our initial approach was to take the entire dataset (all
         90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
         not only within itself, but removing any document matching any other documents in the previously processed
         dumps. </p>
     <p>For instance, for the second most recent dump (2023-40 at
-        the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the higher the number of dumps it was deduplicated against and the more we removed data from it (indeed, in the oldest dumps we removed more than 90% of the data in the deduplication step).</p>
     <p>Deduplicating the dataset in this manner resulted in 4
         trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
-        tokens subset, our ablation models showed no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
     <div class="main-plot-container">
         <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
         <div id="plot-all_dumps_bad"></div>
     </div>
-    <p>This was challenging our assumption that more deduplication was always better so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
     <ul>
         <li>pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
@@ -326,7 +326,7 @@
         removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
             data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
     <h4>Taking a step back: individual dump dedup</h4>
-    <p>We decided to experimence with alternative approaches: we deduplicated
         each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
@@ -362,7 +362,7 @@
     </ul>
     <ul>
         <li>each dump has been perfectly individually deduplicated (every single
-            document in a is unique in this dump)
         </li>
     </ul>
     <ul>
@@ -399,8 +399,8 @@
         removed.</p>
     <h4>Other (failed) global approaches</h4>
-    <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to further improve the performance further deduplicating the
-        independently minhash deduped 20 trillion tokens of data (globally, over all dumps). We explored the following methods:</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
@@ -431,7 +431,7 @@
         <div id="plot-dedup_attempts"></div>
     </div>
-    <h3>Filtering the data even more for quality</h3>
     <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
         RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
     <p>We therefore set out to find new filtering steps that
@@ -454,7 +454,7 @@
     <ul>
         <li>applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
-            ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filter" versus "C4" curves).
         </li>
     </ul>
     <ul>
@@ -535,7 +535,7 @@
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
-    <h3>The final FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
@@ -556,7 +556,7 @@
         <div id="plot-all_filtering_steps"></div>
     </div>
     <h4>Comparisons with other web-scale datasets</h4>
-    <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
@@ -597,7 +597,7 @@
         <figure><img src="assets/images/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
-    <p>🍷 FineWeb is thus –up to our knowledge– the dataset leading to the current highest model performances while allowing to train on several trillion of openly accessible unique tokens.</p>
     <h2>📚 FineWeb-Edu</h2>
@@ -605,7 +605,7 @@
         <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
         <figcaption style="font-style: italic; margin-top: 10px;">📚 FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
     </figure>
-    <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional developement of FineWeb that we are excited to introduce in this tech report and openly release. FineWeb-Edu is based on a new approach that recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite> but its large-scale impact on web data filtering hasn't been really published or fully explored in public yet in our opinion.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
@@ -614,15 +614,15 @@
     <h3>Annotating for educational quality at scale</h3>
     <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
-    <p>We explored various prompt format to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
-    <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
-    <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
@@ -692,8 +692,8 @@
     <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
     <h2>Conclusion and looking forward</h2>
-    <p>Through our open science efforts we hope to open more and more the black box around training high performance large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
-    <p>In particular in the short term, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in this project to make high quality training data available in other languages as well and as accessible as possible.</p>
     <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 </d-article>

         fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
+    <p>This would mean that for two documents with a similarity (s)
         of 0.7, 0.75, 0.8 and 0.85, the probability that they would be identified as duplicates would be 56%, 77%,
+        92% and 98.8% respectively (1-(1-s^8)^{14}). See the plot below for a match probability
         comparison between our setup with 112 hashes and the one from RefinedWeb, with 9000 hashes, divided into 450
         buckets of 20 hashes (that requires a substantially larger amount of compute resources, as each individual hash must be computed, stored and then compared with hashes from other documents):</p>
     <div class="main-plot-container">
     <p>It should also be noted that intra-document deduplication is already handled by our repetition filter, which removes documents with many repeated lines and paragraphs.</p>
     <h4>More deduplication is always better, right?</h4>
+    <p>Initially, we were operating under the assumption that <em>more deduplication is always better</em>, so our first approach was to take the entire dataset (all
         90+ dumps) and deduplicate them together as one big dataset using MinHash.</p>
     <p>We did this in an iterative manner: starting with the most
         recent dump (which at the time was 2023-50) and proceeding chronologically until we reached the oldest crawl. We deduplicated each dump
         not only within itself, but removing any document matching any other documents in the previously processed
         dumps. </p>
     <p>For instance, for the second most recent dump (2023-40 at
+        the time), we deduplicated it against the most recent one in addition to within itself. As a result, the older the dumps, the larger the number of dumps it was deduplicated against and the more data we removed from it (indeed, in the oldest dumps, the deduplication step removed more than 90% of the base filtered data).</p>
     <p>Deduplicating the dataset in this manner resulted in 4
         trillion tokens of data, but, quite surprisingly to us, when training on a randomly sampled 350 billion
+        tokens subset, our ablation models showed next to no improvement over a model trained on the non deduplicated data, scoring far below its predecessor RefinedWeb on our aggregate of tasks (see graph below).</p>
     <div class="main-plot-container">
         <figure><img src="assets/images/dedup_all_dumps_bad.png"/></figure>
         <div id="plot-all_dumps_bad"></div>
     </div>
+    <p>This challenged our assumption that more deduplication would inevitably result in higher benchmark scores, so we decided to take a closer look at one of the oldest dumps, dump 2013-48:</p>
     <ul>
         <li>pre deduplication, this dump had ~490 billion tokens</li>
     </ul>
         removed<d-footnote>Note that these ablation models are trained only on data from this dump so it's considered independently of all the other dumps.</d-footnote>. This is also confirmed by visual inspection: <em>originally kept
             data</em> contains far more ads, lists of keywords and generally badly formatted text than <em>originally removed data</em>.</p>
     <h4>Taking a step back: individual dump dedup</h4>
+    <p>We decided to experiment with an alternative approach: we deduplicated
         each dump with MinHash individually (independently of the other dumps). This resulted in 20 trillion
         tokens of data.</p>
     <p>When training on a random sample from this dataset we see
     </ul>
     <ul>
         <li>each dump has been perfectly individually deduplicated (every single
+            document is unique in this dump)
         </li>
     </ul>
     <ul>
         removed.</p>
     <h4>Other (failed) global approaches</h4>
+    <p>To build on top of our newly found method (independently deduplicating each dump). We attempted to improve the performance by further deduplicating the
+        independently minhash deduped 20 trillion tokens of data with alternative global (over all dumps) deduplication methods. We explored the following approaches:</p>
     <ul>
         <li>URL deduplication, where we only kept one document per normalized
             (lowercased) URL (71.5% of tokens removed, 5.6T left) — <em>FineWeb URL dedup</em></li>
         <div id="plot-dedup_attempts"></div>
     </div>
+    <h3>Additional quality filtering</h3>
     <p>By this point we had reached the same performance of the previous work we attempted to reproduce and extend:
         RefinedWeb, using our base filtering and independent MinHash. Still, on our aggregate of tasks, another heavily filtered dataset, the C4 dataset<d-cite bibtex-key="raffel2023exploring"></d-cite>, still showed stronger performances on some benchmarks of our evaluation suite.</p>
     <p>We therefore set out to find new filtering steps that
     <ul>
         <li>applying “All filters” (drop lines not ending on punctuation marks,
             mentioning javascript and cookie notices + drop documents outside length thresholds, containing “lorem
+            ipsum” or a curly bracket, <code>{</code>) allows us to match C4’s HellaSwag performance ("All filters" vs "C4" curves, respectively).
         </li>
     </ul>
     <ul>
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
+    <h3>The final 🍷 FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
         <div id="plot-all_filtering_steps"></div>
     </div>
     <h4>Comparisons with other web-scale datasets</h4>
+    <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality openly accessible web-scale datasets (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
                 href="https://huggingface.co/datasets/tiiuae/falcon-refinedweb">RefinedWeb</a> (500B tokens)<d-cite bibtex-key="penedo2023refinedweb"></d-cite>
         <figure><img src="assets/images/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
+    <p>🍷 FineWeb is thus – to the best of our knowledge – the open dataset leading to the current highest model performances while allowing to train on several trillion tokens.</p>
     <h2>📚 FineWeb-Edu</h2>
         <img src="assets/images/dataset_comparisons_agg_fw_edu.png"/>
         <figcaption style="font-style: italic; margin-top: 10px;">📚 FineWeb-Edu outperforms 🍷 FineWeb and all other open web datasets on our group of evaluation tasks.</figcaption>
     </figure>
+    <p><a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is an additional development of FineWeb that we are excited to introduce in this tech report and openly release. 📚 FineWeb-Edu is based on a new approach that has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was notably used in the trainings of Llama 3<d-cite bibtex-key="llama3modelcard"></d-cite> and Phi3<d-cite bibtex-key="abdin2024phi"></d-cite>, but its large-scale impact on web data filtering has, in our opinion, thur far not been publicly explored to its full potential.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the paper<d-cite bibtex-key="abdin2024phi"></d-cite> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
     <p>Similarly, Llama 3 blog post<d-cite bibtex-key="meta2024responsible"></d-cite> notes:</p>
     <h3>Annotating for educational quality at scale</h3>
     <p>We used <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> to annotate 500k samples from 🍷 FineWeb, scoring each for their educational quality on a scale from 0 to 5.</p>
+    <p>We explored various prompt formats to automatically extract an educational score using an LLM and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
+    <p>In terms of open-weight models to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering the scores from these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experiments we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
+    <p>To scale our annotations to the trillions of tokens in FineWeb, we used the Llama3-70B annotations to train a small classifier. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on the 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <p>We expect to continue seeing increasing quantities of synthetic data on new CC crawls. However, while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.</p>
     <h2>Conclusion and looking forward</h2>
+    <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
+    <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
     <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 </d-article>