Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

loubnabnl HF Staff commited on May 29, 2024

Commit

595179b

1 Parent(s): 31eb363

add FineWeb-edu

Browse files

Files changed (1) hide show

index.html +34 -0

index.html CHANGED Viewed

@@ -651,6 +651,40 @@
     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>Just like fine wine, not all crawls are created
         equal</h2>
     <p>During our ablation runs, we observed that certain crawls

     <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
         FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
+    <h2>📚 FineWeb-Edu</h2>
+    <p>We are excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks.</p>
+    <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
+    <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
+    <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
+    <p>Similarly, <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3 blog post</a> notes:</p>
+    <blockquote>We found that previous generations of Llama are good at identifying high-quality data, so we used Llama 2 to help build the text-quality classifiers that are powering Llama 3.</blockquote>
+    <p>However, these classifiers and filtered datasets are not publicly available. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a> to create FineWeb-Edu.</p>
+    <h3>Annotation</h3>
+    <p>We used Llama3-70B-Instruct to annotate 500k samples from the FineWeb dataset, scoring each for their educational quality on a scale from 0 to 5.</p>
+    <p>We explored various prompts and found that the additive scale by <a href="https://arxiv.org/pdf/2401.10020">Yuan et al.</a> worked best. This scale allows the LLM to reason about each additional point awarded, unlike the single-rating Likert scale which fits samples into predefined boxes. Then, to avoid the LLM favoring highly technical pages like arXiv abstracts and submissions, we focused on grade-school and middle-school level knowledge. By setting a threshold of 3 (on a scale of 0 to 5) during the filtering process, we were able to also retain some high-level educational pages.</p>
+    <div style="text-align: center; margin: 20px 0;">
+        <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/fjZQ4izIj1rx1xQnBTKKr.png" alt="Prompt for LLM annotation" style="width: 90%; max-width: 800px; height: auto;">
+        <figcaption style="font-style: italic; margin-top: 10px;">Prompt used for Llama3 annotations of the educational score</figcaption>
+    </div>
+    <p>We also experimented with different LLMs: <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama3-70B-Instruct</a>, <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x-7B-Instruct</a>, and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>. Llama3 and Mixtral-8x22B produced similar scores, while Mixtral-8x7B tended to be more generous, not fully adhering to the score scale. <a href="https://arxiv.org/abs/2404.18796">Verga et al.</a> suggest using multiple LLMs as juries. We tried averaging the scores from the three models, but this shifted the distribution to the right due to the higher scores from Mixtral-8x7B. Training on a dataset filtered with a classifier using jury annotations performed worse than using a classifier based on Llama3 annotations. We hypothesize that the jury-based approach retains more low-quality samples.</p>
+    <div style="text-align: center; margin: 20px 0;">
+        <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/dQskZA-4fsk8aR_8g9evJ.png" style="width: 80%; max-width: 700px; height: auto;"></figure>
+    </div>
+    <h3>Classifier Training</h3>
+    <p>We added a classification head with a single regression output to <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> and trained it on 450,000 Llama3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our validation set. After training, we rounded the scores to integers from 0 to 5. This approach resulted in the model achieving an F1 score of 82%, indicating robust performance in distinguishing high-quality educational content.</p>
+    <p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg">https://huggingface.co/HuggingFaceTB/snowflake_m_edu_reg</a>. The training and inference code is available on  <a href="https://github.com/huggingface/cosmopedia/tree/edu-classifier/classification">GitHub</a>.</p>
+    <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
+    <h3>Filtering</h3>
+    <p>We applied the classifier to the 15T tokens of FineWeb, a process that required 6,000 H100 GPU hours. To build FineWeb-Edu, we filtered out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. Here are the key highlights of the ablation results:</p>
+    <ul>
+        <li>FineWeb-Edu surpasses FineWeb and all other web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
+        <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma1.7 to match MMLU results.</li>
+        <li>It gives strong performance boosts on benchmarks like MMLU and ARC without trying to overfit on them.</li>
+        <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
+    </ul>
+    <p>To keep more tokens, we also experimented with a less strict threshold of 2 instead of 3. This approach preserved 4.5T tokens and still outperformed the FineWeb dataset, with performance just slightly below that of threshold 3.</p>
+    <p>We release these two datasets as FineWeb-Edu and FineWeb-edu-Large along with the classifier used for the filtering.</p>
+    <p><strong>TODO: add ablation results and dataset links, and maybe FineWeb-edu-smol</strong></p>
     <h2>Just like fine wine, not all crawls are created
         equal</h2>
     <p>During our ablation runs, we observed that certain crawls