Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

loubnabnl HF Staff commited on May 31, 2024

Commit

fef941e

1 Parent(s): 2f0ac6d

remove todos and add citation

Browse files

Files changed (2) hide show

bibliography.bib +12 -0
index.html +7 -8

bibliography.bib CHANGED Viewed

@@ -209,4 +209,16 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       eprint={2401.04088},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }

       eprint={2401.04088},
       archivePrefix={arXiv},
       primaryClass={cs.LG}
+}
+@article{yuan2024self,
+  title={Self-rewarding language models},
+  author={Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason},
+  journal={arXiv preprint arXiv:2401.10020},
+  year={2024}
+}
+@article{verga2024replacing,
+  title={Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models},
+  author={Verga, Pat and Hofstatter, Sebastian and Althammer, Sophia and Su, Yixuan and Piktus, Aleksandra and Arkhangorodsky, Arkady and Xu, Minjie and White, Naomi and Lewis, Patrick},
+  journal={arXiv preprint arXiv:2404.18796},
+  year={2024}
 }

index.html CHANGED Viewed

@@ -685,18 +685,18 @@
     <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create 📚 FineWeb-Edu.</p>
     <h3>Annotation</h3>
     <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
-    <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
-    <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models following <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a>, but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
-    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of ~47k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <h3>Filtering and results</h3>
-    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-8k.png">
@@ -713,12 +713,11 @@
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
-        <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
-    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens. Additionally, for research purposes, we are providing the dataset filtered with a threshold of 4 with 300 billion tokens.</p>
-    <p>You can find the three datasets along with the classifier used for the filtering in this collection:TODO</p>
-    <p><strong>TODO: add dataset links and a collection</strong></p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>

     <p>However, these classifiers and filtered datasets are not publicly available. To enhance 🍷 FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create 📚 FineWeb-Edu.</p>
     <h3>Annotation</h3>
     <p>We used Llama3-70B-Instruct to annotate 500k samples from the 🍷 FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
+    <p>We explored various prompts and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
     <div style="text-align: center; margin: 20px 0;">
         <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
         <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
     </div>
+    <p>We also experimented with  <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
     <h3>Classifier Training</h3>
+    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers.  We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
     <p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
     <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
     <h3>Filtering and results</h3>
+    <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
     <div class="main-plot-container">
         <figure>
             <img src="plots/edu-8k.png">
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
+        <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
         <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
     </ul>
+    <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
+    <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
     <h2>Next steps</h2>
     <p>We want to continue improving FineWeb and will also
         release a technical report with more details soon.</p>