add FineWeb-edu
Browse files- index.html +34 -0
index.html
CHANGED
@@ -651,6 +651,40 @@
|
|
651 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
652 |
FineWeb:</p>
|
653 |
<figure><img src="plots/Untitled%203.png"/></figure>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
654 |
<h2>Just like fine wine, not all crawls are created
|
655 |
equal</h2>
|
656 |
<p>During our ablation runs, we observed that certain crawls
|
|
|
651 |
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
652 |
FineWeb:</p>
|
653 |
<figure><img src="plots/Untitled%203.png"/></figure>
|
654 |
+
<h2>π FineWeb-Edu</h2>
|
655 |
+
<p>We are excited to release π FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks.</p>
|
656 |
+
<p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
|
657 |
+
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
|
658 |
+
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
659 |
+
<p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
|
660 |
+
<blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
|
661 |
+
<p>However, these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create FineWeb-Edu.</p>
|
662 |
+
<h3>Annotation</h3>
|
663 |
+
<p>We used Llama3-70B-Instruct to annotate 500k samples from the FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
|
664 |
+
<p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
|
665 |
+
<div style="text-align: center; margin: 20px 0;">
|
666 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
|
667 |
+
<figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score</figcaption>
|
668 |
+
</div>
|
669 |
+
<p>We also experimented with different LLMs: <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a>, <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a>, and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>. Llama3 and Mixtral-8x22B produced similar scores, while Mixtral-8x7B tended to be more generous, not fully adhering to the score scale. <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a> suggest using multiple LLMs as juries. We tried averaging the scores from the three models, but this shifted the distribution to the right due to the higher scores from Mixtral-8x7B. Training on a dataset filtered with a classifier using jury annotations performed worse than using a classifier based on Llama3 annotations. We hypothesize that the jury-based approach retains more low-quality samples.</p>
|
670 |
+
<div style="text-align: center; margin: 20px 0;">
|
671 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/dQskZA-4fsk8aR_8g9evJ.png" style="width: 80%; max-width: 700px; height: auto;"></figure>
|
672 |
+
</div>
|
673 |
+
<h3>Classifier Training</h3>
|
674 |
+
<p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our validation set. After training, we rounded the scores to integers from 0 to 5. This approach resulted in the model achieving an F1 score of 82%, indicating robust performance in distinguishing high-quality educational content.</p>
|
675 |
+
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
|
676 |
+
<p><strong>TODO: fill model card and move the github code to another folder</strong></p>
|
677 |
+
<h3>Filtering</h3>
|
678 |
+
<p>We applied the classifier to the 15T tokens of FineWeb, a process that required 6,000 H100 GPU hours. To build FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
|
679 |
+
<ul>
|
680 |
+
<li>FineWeb-Edu surpasses FineWeb and all other web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
681 |
+
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
|
682 |
+
<li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
|
683 |
+
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
684 |
+
</ul>
|
685 |
+
<p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
|
686 |
+
<p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
|
687 |
+
<p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
|
688 |
<h2>Just like fine wine, not all crawls are created
|
689 |
equal</h2>
|
690 |
<p>During our ablation runs, we observed that certain crawls
|