nits
Browse files- index.html +4 -4
index.html
CHANGED
@@ -168,11 +168,11 @@
|
|
168 |
<d-contents>
|
169 |
</d-contents>
|
170 |
|
171 |
-
<p>We have recently released
|
172 |
-
(15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
173 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
174 |
<p>[TODO: ADD MORE INTRODUCTION]</p>
|
175 |
-
<p>We are also excited to release
|
176 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
177 |
|
178 |
<p>As 🍷FineWeb has gathered a lot of interest from the
|
@@ -265,7 +265,7 @@
|
|
265 |
min on a single node of 8 GPUs - done in parallel to the training).</p>
|
266 |
<aside>You can find the full list of tasks and prompts we used <a
|
267 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
|
268 |
-
<h2>The FineWeb recipe</h2>
|
269 |
<p>In the next subsections we will explain each of the steps
|
270 |
taken to produce the FineWeb dataset.</p>
|
271 |
<figure class="l-body">
|
|
|
168 |
<d-contents>
|
169 |
</d-contents>
|
170 |
|
171 |
+
<p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
|
172 |
+
(<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
|
173 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
|
174 |
<p>[TODO: ADD MORE INTRODUCTION]</p>
|
175 |
+
<p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
|
176 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
177 |
|
178 |
<p>As 🍷FineWeb has gathered a lot of interest from the
|
|
|
265 |
min on a single node of 8 GPUs - done in parallel to the training).</p>
|
266 |
<aside>You can find the full list of tasks and prompts we used <a
|
267 |
href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
|
268 |
+
<h2>The 🍷 FineWeb recipe</h2>
|
269 |
<p>In the next subsections we will explain each of the steps
|
270 |
taken to produce the FineWeb dataset.</p>
|
271 |
<figure class="l-body">
|