guipenedo HF staff commited on
Commit
efb8c33
·
1 Parent(s): 91fa5b9
Files changed (1) hide show
  1. index.html +4 -4
index.html CHANGED
@@ -168,11 +168,11 @@
168
  <d-contents>
169
  </d-contents>
170
 
171
- <p>We have recently released 🍷FineWeb, our new large scale
172
- (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
173
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
174
  <p>[TODO: ADD MORE INTRODUCTION]</p>
175
- <p>We are also excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
176
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
177
 
178
  <p>As 🍷FineWeb has gathered a lot of interest from the
@@ -265,7 +265,7 @@
265
  min on a single node of 8 GPUs - done in parallel to the training).</p>
266
  <aside>You can find the full list of tasks and prompts we used <a
267
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
268
- <h2>The FineWeb recipe</h2>
269
  <p>In the next subsections we will explain each of the steps
270
  taken to produce the FineWeb dataset.</p>
271
  <figure class="l-body">
 
168
  <d-contents>
169
  </d-contents>
170
 
171
+ <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
172
+ (<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
173
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
174
  <p>[TODO: ADD MORE INTRODUCTION]</p>
175
+ <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
176
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
177
 
178
  <p>As 🍷FineWeb has gathered a lot of interest from the
 
265
  min on a single node of 8 GPUs - done in parallel to the training).</p>
266
  <aside>You can find the full list of tasks and prompts we used <a
267
  href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
268
+ <h2>The 🍷 FineWeb recipe</h2>
269
  <p>In the next subsections we will explain each of the steps
270
  taken to produce the FineWeb dataset.</p>
271
  <figure class="l-body">