move fineweb-edu intro to the top
Browse files- index.html +3 -1
index.html
CHANGED
@@ -172,6 +172,9 @@
|
|
172 |
(15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
173 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
174 |
<p>[TODO: ADD MORE INTRODUCTION]</p>
|
|
|
|
|
|
|
175 |
<p>As 🍷FineWeb has gathered a lot of interest from the
|
176 |
community, we decided to further explain the steps involved in creating it, our processing decisions and
|
177 |
some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
|
@@ -662,7 +665,6 @@
|
|
662 |
FineWeb:</p>
|
663 |
<figure><img src="plots/Untitled%203.png"/></figure>
|
664 |
<h2>📚 FineWeb-Edu</h2>
|
665 |
-
<p>We are excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks.</p>
|
666 |
<p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
|
667 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
|
668 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|
|
|
172 |
(15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
173 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
174 |
<p>[TODO: ADD MORE INTRODUCTION]</p>
|
175 |
+
<p>We are also excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
|
176 |
+
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
177 |
+
|
178 |
<p>As 🍷FineWeb has gathered a lot of interest from the
|
179 |
community, we decided to further explain the steps involved in creating it, our processing decisions and
|
180 |
some lessons learned along the way. Read on for all the juicy details on large text dataset creation!</p>
|
|
|
665 |
FineWeb:</p>
|
666 |
<figure><img src="plots/Untitled%203.png"/></figure>
|
667 |
<h2>📚 FineWeb-Edu</h2>
|
|
|
668 |
<p>A new approach has recently emerged for filtering LLM training datasets: using synthetic data to develop classifiers for identifying educational content. This technique was used in the trainings of <a href="https://ai.meta.com/blog/meta-llama-3-meta-ai-responsibility/">LLama3</a> and <a href="https://arxiv.org/abs/2404.14219">Phi3</a>, but its large-scale impact on web data filtering hasn't been fully explored or published.</p>
|
669 |
<p>The popular Phi3 models were trained on 3.3 and 4.8 trillion tokens, with the <a href="https://arxiv.org/abs/2404.14219">paper</a> stating:</p>
|
670 |
<blockquote>Our training data consists of heavily filtered publicly available web data (according to the 'educational level') from various open internet sources, as well as synthetic LLM-generated data.</blockquote>
|