thomwolf HF Staff commited on
Commit
01ef238
Β·
1 Parent(s): f545b17
Files changed (2) hide show
  1. dist/index.html +3 -3
  2. src/index.html +2 -2
dist/index.html CHANGED
@@ -535,7 +535,7 @@
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
- <h2>The final dataset</h2>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
@@ -555,7 +555,7 @@
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
- <h3>Comparisons with other web-scale datasets</h3>
559
  <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
@@ -617,7 +617,7 @@
617
  <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
618
 
619
  <h3>Training a classifier</h3>
620
- <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>30</code> to <code>5</code>.</p>
621
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
622
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
623
 
 
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
+ <h3>The final FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
 
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
+ <h4>Comparisons with other web-scale datasets</h4>
559
  <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
 
617
  <p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
618
 
619
  <h3>Training a classifier</h3>
620
+ <p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
621
  <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
622
  <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
623
 
src/index.html CHANGED
@@ -535,7 +535,7 @@
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
- <h2>The final dataset</h2>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
@@ -555,7 +555,7 @@
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
- <h3>Comparisons with other web-scale datasets</h3>
559
  <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a
 
535
  </div>
536
  <p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
537
 
538
+ <h3>The final FineWeb dataset</h3>
539
  <p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> dataset comprises 15T tokens and
540
  includes the following previously mentioned steps, in order, each providing a performance boost on our group
541
  of benchmark tasks:</p>
 
555
  <figure><img src="assets/images/filtering_steps.png"/></figure>
556
  <div id="plot-all_filtering_steps"></div>
557
  </div>
558
+ <h4>Comparisons with other web-scale datasets</h4>
559
  <p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">🍷 FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
560
  <ul>
561
  <li><a