delete some histograms
Browse files- index.html +12 -5
index.html
CHANGED
@@ -676,9 +676,6 @@
|
|
676 |
<figure><img src="plots/dataset_ablations.png"/></figure>
|
677 |
<div id="plot-dataset_ablations"></div>
|
678 |
</div>
|
679 |
-
<p>Some histogram comparisons of C4, Dolma, RefinedWeb and
|
680 |
-
π· FineWeb:</p>
|
681 |
-
<figure><img src="plots/Untitled%203.png"/></figure>
|
682 |
<h2>π FineWeb-Edu</h2>
|
683 |
<p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
|
684 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
|
@@ -701,9 +698,19 @@
|
|
701 |
<p><strong>TODO: fill model card and move the github code to another folder</strong></p>
|
702 |
<h3>Filtering and results</h3>
|
703 |
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
704 |
-
<
|
|
|
|
|
|
|
|
|
|
|
705 |
<p>We then built π FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
|
706 |
-
<
|
|
|
|
|
|
|
|
|
|
|
707 |
<p>Here are the key highlights of the ablation results above:</p>
|
708 |
<ul>
|
709 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
|
|
676 |
<figure><img src="plots/dataset_ablations.png"/></figure>
|
677 |
<div id="plot-dataset_ablations"></div>
|
678 |
</div>
|
|
|
|
|
|
|
679 |
<h2>π FineWeb-Edu</h2>
|
680 |
<p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
|
681 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
|
|
|
698 |
<p><strong>TODO: fill model card and move the github code to another folder</strong></p>
|
699 |
<h3>Filtering and results</h3>
|
700 |
<p>We applied the classifier to the 15T tokens of π· FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
|
701 |
+
<div class="main-plot-container">
|
702 |
+
<figure>
|
703 |
+
<img src="plots/edu-8k.png">
|
704 |
+
</figure>
|
705 |
+
<div id="plot-edu-8k"></div>
|
706 |
+
</div>
|
707 |
<p>We then built π FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
|
708 |
+
<div class="main-plot-container">
|
709 |
+
<figure>
|
710 |
+
<img src="plots/edu-100k.png">
|
711 |
+
</figure>
|
712 |
+
<div id="plot-edu-100k"></div>
|
713 |
+
</div>
|
714 |
<p>Here are the key highlights of the ablation results above:</p>
|
715 |
<ul>
|
716 |
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|