remove todos and add citation
Browse files- bibliography.bib +12 -0
- index.html +7 -8
bibliography.bib
CHANGED
@@ -209,4 +209,16 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
|
|
209 |
eprint={2401.04088},
|
210 |
archivePrefix={arXiv},
|
211 |
primaryClass={cs.LG}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
}
|
|
|
209 |
eprint={2401.04088},
|
210 |
archivePrefix={arXiv},
|
211 |
primaryClass={cs.LG}
|
212 |
+
}
|
213 |
+
@article{yuan2024self,
|
214 |
+
title={Self-rewarding language models},
|
215 |
+
author={Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason},
|
216 |
+
journal={arXiv preprint arXiv:2401.10020},
|
217 |
+
year={2024}
|
218 |
+
}
|
219 |
+
@article{verga2024replacing,
|
220 |
+
title={Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models},
|
221 |
+
author={Verga, Pat and Hofstatter, Sebastian and Althammer, Sophia and Su, Yixuan and Piktus, Aleksandra and Arkhangorodsky, Arkady and Xu, Minjie and White, Naomi and Lewis, Patrick},
|
222 |
+
journal={arXiv preprint arXiv:2404.18796},
|
223 |
+
year={2024}
|
224 |
}
|
index.html
CHANGED
@@ -685,18 +685,18 @@
|
|
685 |
<p>However, these classifiers and filtered datasets are not publicly available. To enhance π· FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create π FineWeb-Edu.</p>
|
686 |
<h3>Annotation</h3>
|
687 |
<p>We used Llama3-70B-Instruct to annotate 500k samples from the π· FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
|
688 |
-
<p>We explored various prompts and found that the additive scale by
|
689 |
<div style="text-align: center; margin: 20px 0;">
|
690 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
691 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
692 |
</div>
|
693 |
-
<p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models
|
694 |
<h3>Classifier Training</h3>
|
695 |
-
<p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of
|
696 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
697 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
698 |
<h3>Filtering and results</h3>
|
699 |
-
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
700 |
<div class="main-plot-container">
|
701 |
<figure>
|
702 |
<img src="plots/edu-8k.png">
|
@@ -713,12 +713,11 @@
|
|
713 |
<p>Here are the key highlights of the ablation results above:</p>
|
714 |
<ul>
|
715 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
716 |
-
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and
|
717 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
718 |
</ul>
|
719 |
-
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens
|
720 |
-
<p>You can find the
|
721 |
-
<p><strong>TODO: add dataset links and a collection</strong></p>
|
722 |
<h2>Next steps</h2>
|
723 |
<p>We want to continue improving FineWeb and will also
|
724 |
release a technical report with more details soon.</p>
|
|
|
685 |
<p>However, these classifiers and filtered datasets are not publicly available. To enhance π· FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create π FineWeb-Edu.</p>
|
686 |
<h3>Annotation</h3>
|
687 |
<p>We used Llama3-70B-Instruct to annotate 500k samples from the π· FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
|
688 |
+
<p>We explored various prompts and found that the additive scale by Yuan et al.<d-cite bibtex-key="yuan2024self"></d-cite> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
689 |
<div style="text-align: center; margin: 20px 0;">
|
690 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
691 |
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score, also available on <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier/blob/main/utils/prompt.txt">here</a>.</figcaption>
|
692 |
</div>
|
693 |
+
<p>We also experimented with <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a> and a jury of all three models<d-cite bibtex-key="verga2024replacing"></d-cite> but found that Llama3 alone gave the most reliable results.</p>
|
694 |
<h3>Classifier Training</h3>
|
695 |
+
<p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama3 annotations as ground-truth. After training, we rounded the scores to integers from 0 to 5.</p>
|
696 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of 3, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
697 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
698 |
<h3>Filtering and results</h3>
|
699 |
+
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best overall results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
700 |
<div class="main-plot-container">
|
701 |
<figure>
|
702 |
<img src="plots/edu-8k.png">
|
|
|
713 |
<p>Here are the key highlights of the ablation results above:</p>
|
714 |
<ul>
|
715 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
716 |
+
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
|
717 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
718 |
</ul>
|
719 |
+
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
720 |
+
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
|
|
721 |
<h2>Next steps</h2>
|
722 |
<p>We want to continue improving FineWeb and will also
|
723 |
release a technical report with more details soon.</p>
|