hynky HF Staff commited on
Commit
2c9e5db
Β·
1 Parent(s): d25af98

delete some histograms

Browse files
Files changed (1) hide show
  1. index.html +12 -5
index.html CHANGED
@@ -676,9 +676,6 @@
676
  <figure><img src="plots/dataset_ablations.png"/></figure>
677
  <div id="plot-dataset_ablations"></div>
678
  </div>
679
- <p>Some histogram comparisons of C4, Dolma, RefinedWeb and
680
- 🍷 FineWeb:</p>
681
- <figure><img src="plots/Untitled%203.png"/></figure>
682
  <h2>πŸ“š FineWeb-Edu</h2>
683
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
684
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
@@ -701,9 +698,19 @@
701
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
702
  <h3>Filtering and results</h3>
703
  <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
704
- <p><strong>TODO: add the plot</strong></p>
 
 
 
 
 
705
  <p>We then built πŸ“š FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
706
- <p><strong>TODO: add the plot</strong></p>
 
 
 
 
 
707
  <p>Here are the key highlights of the ablation results above:</p>
708
  <ul>
709
  <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
 
676
  <figure><img src="plots/dataset_ablations.png"/></figure>
677
  <div id="plot-dataset_ablations"></div>
678
  </div>
 
 
 
679
  <h2>πŸ“š FineWeb-Edu</h2>
680
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
681
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
 
698
  <p><strong>TODO: fill model card and move the github code to another folder</strong></p>
699
  <h3>Filtering and results</h3>
700
  <p>We applied the classifier to the 15T tokens of 🍷 FineWeb, a process that required 6,000 H100 GPU hours. We investigated the impact of using different thresholds for the filtering and found that threshold 3 gave the best results. The plot below shows the performance of each threshold compared to FineWeb on six different benchmarks; it uses a 1.82B model trained on 8B tokens.</p>
701
+ <div class="main-plot-container">
702
+ <figure>
703
+ <img src="plots/edu-8k.png">
704
+ </figure>
705
+ <div id="plot-edu-8k"></div>
706
+ </div>
707
  <p>We then built πŸ“š FineWeb-Edu by filtering out samples with scores lower than 3. This removed 92% of the dataset, leaving us with 1.2T educational tokens. To evaluate the effectiveness of this filtering at a larger scale, we conducted an ablation using a 1.82B model trained on 350 billion tokens, similar to the FineWeb filtering ablation mentioned above:</p>
708
+ <div class="main-plot-container">
709
+ <figure>
710
+ <img src="plots/edu-100k.png">
711
+ </figure>
712
+ <div id="plot-edu-100k"></div>
713
+ </div>
714
  <p>Here are the key highlights of the ablation results above:</p>
715
  <ul>
716
  <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>