Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

thomwolf HF Staff commited on Jun 1, 2024

Commit

01ef238

1 Parent(s): f545b17

update

Browse files

Files changed (2) hide show

dist/index.html +3 -3
src/index.html +2 -2

dist/index.html CHANGED Viewed

@@ -535,7 +535,7 @@
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
-    <h2>The final dataset</h2>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
@@ -555,7 +555,7 @@
         <figure><img src="assets/images/filtering_steps.png"/></figure>
         <div id="plot-all_filtering_steps"></div>
     </div>
-    <h3>Comparisons with other web-scale datasets</h3>
     <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
@@ -617,7 +617,7 @@
     <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
-    <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>

     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
+    <h3>The final FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
         <figure><img src="assets/images/filtering_steps.png"/></figure>
         <div id="plot-all_filtering_steps"></div>
     </div>
+    <h4>Comparisons with other web-scale datasets</h4>
     <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a
     <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
     <h3>Training a classifier</h3>
+    <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>

src/index.html CHANGED Viewed

@@ -535,7 +535,7 @@
     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
-    <h2>The final dataset</h2>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
@@ -555,7 +555,7 @@
         <figure><img src="assets/images/filtering_steps.png"/></figure>
         <div id="plot-all_filtering_steps"></div>
     </div>
-    <h3>Comparisons with other web-scale datasets</h3>
     <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a

     </div>
     <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
+    <h3>The final FineWeb dataset</h3>
     <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
         includes the following previously mentioned steps, in order, each providing a performance boost on our group
         of benchmark tasks:</p>
         <figure><img src="assets/images/filtering_steps.png"/></figure>
         <div id="plot-all_filtering_steps"></div>
     </div>
+    <h4>Comparisons with other web-scale datasets</h4>
     <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
     <ul>
         <li><a