guipenedo HF staff commited on
Commit
70d3e30
·
1 Parent(s): 875453d

move fineweb-edu intro to the top

Browse files
Files changed (1) hide show
  1. index.html +3 -1
index.html CHANGED
@@ -172,6 +172,9 @@
172
  (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
173
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
174
  <p>[TODO: ADD MORE INTRODUCTION]</p>
 
 
 
175
  <p>As 🍷FineWeb has gathered a lot of interest from the
176
  community, we decided to further explain the steps involved in creating it, our processing decisions and
177
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
@@ -662,7 +665,6 @@
662
  FineWeb:</p>
663
  <figure><img src="plots/Untitled%203.png"/></figure>
664
  <h2>📚 FineWeb-Edu</h2>
665
- <p>We are excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks.</p>
666
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
667
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
668
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
 
172
  (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
173
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
174
  <p>[TODO: ADD MORE INTRODUCTION]</p>
175
+ <p>We are also excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
176
+ download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
177
+
178
  <p>As 🍷FineWeb has gathered a lot of interest from the
179
  community, we decided to further explain the steps involved in creating it, our processing decisions and
180
  some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
 
665
  FineWeb:</p>
666
  <figure><img src="plots/Untitled%203.png"/></figure>
667
  <h2>📚 FineWeb-Edu</h2>
 
668
  <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
669
  <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
670
  <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>