Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

hynky HF Staff commited on May 30, 2024

Commit

2c9e5db

1 Parent(s): d25af98

delete some histograms

Browse files

Files changed (1) hide show

index.html +12 -5

index.html CHANGED Viewed

@@ -676,9 +676,6 @@
         <figure><img src="plots/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
-    <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
-        🍷 FineWeb:</p>
-    <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>📚 FineWeb-Edu</h2>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
@@ -701,9 +698,19 @@
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
     <h3>Filtering and results</h3>
     <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
-    <p><strong>TODO: add the plot</strong></p>
     <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
-    <p><strong>TODO: add the plot</strong></p>
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>

         <figure><img src="plots/dataset_ablations.png"/></figure>
         <div id="plot-dataset_ablations"></div>
     </div>
     <h2>📚 FineWeb-Edu</h2>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
     <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
     <h3>Filtering and results</h3>
     <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
+    <div class="main-plot-container">
+        <figure>
+            <img src="plots/edu-8k.png">
+        </figure>
+        <div id="plot-edu-8k"></div>
+    </div>
     <p>We then built  📚 FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens.  To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
+    <div class="main-plot-container">
+        <figure>
+            <img src="plots/edu-100k.png">
+        </figure>
+        <div id="plot-edu-100k"></div>
+    </div>
     <p>Here are the key highlights of the ablation results above:</p>
     <ul>
         <li>📚 FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>