Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on May 29, 2024

Commit

70d3e30

1 Parent(s): 875453d

move fineweb-edu intro to the top

Browse files

Files changed (1) hide show

index.html +3 -1

index.html CHANGED Viewed

@@ -172,6 +172,9 @@
         (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
     <p>[TODO: ADD MORE INTRODUCTION]</p>
     <p>As 🍷FineWeb has gathered a lot of interest from the
         community, we decided to further explain the steps involved in creating it, our processing decisions and
         some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
@@ -662,7 +665,6 @@
         FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>📚 FineWeb-Edu</h2>
-    <p>We are excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks.</p>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>

         (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
     <p>[TODO: ADD MORE INTRODUCTION]</p>
+    <p>We are also excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
+        download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>As 🍷FineWeb has gathered a lot of interest from the
         community, we decided to further explain the steps involved in creating it, our processing decisions and
         some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
         FineWeb:</p>
     <figure><img src="plots/Untitled%203.png"/></figure>
     <h2>📚 FineWeb-Edu</h2>
     <p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
     <p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the  <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
     <blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>