Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF Staff commited on May 30, 2024

Commit

efb8c33

1 Parent(s): 91fa5b9

nits

Browse files

Files changed (1) hide show

index.html +4 -4

index.html CHANGED Viewed

@@ -168,11 +168,11 @@
     <d-contents>
     </d-contents>
-    <p>We have recently released 🍷FineWeb, our new large scale
-        (15T gpt2 tokens, 44TB disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
     <p>[TODO: ADD MORE INTRODUCTION]</p>
-    <p>We are also excited to release 📚 FineWeb-Edu, a filtered version of FineWeb for educational content, available in two sizes: 1.2 trillion and 4.5 trillion tokens. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>As 🍷FineWeb has gathered a lot of interest from the
@@ -265,7 +265,7 @@
         min on a single node of 8 GPUs - done in parallel to the training).</p>
     <aside>You can find the full list of tasks and prompts we used <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
-    <h2>The FineWeb recipe</h2>
     <p>In the next subsections we will explain each of the steps
         taken to produce the FineWeb dataset.</p>
     <figure class="l-body">

     <d-contents>
     </d-contents>
+    <p>We have recently released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, our new large scale
+        (<strong>15T</strong> gpt2 tokens, <strong>44TB</strong> disk space) dataset of clean text sourced from the web for LLM pretraining. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.</p>
     <p>[TODO: ADD MORE INTRODUCTION]</p>
+    <p>We are also excited to announce the release of <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a filtered version of FineWeb for educational content, available in two sizes: <strong>1.2 trillion and 4.5 trillion tokens</strong>. FineWeb-Edu outperforms all existing public web datasets, with notable improvements on MMLU, ARC, and OpenBookQA benchmarks. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>As 🍷FineWeb has gathered a lot of interest from the
         min on a single node of 8 GPUs - done in parallel to the training).</p>
     <aside>You can find the full list of tasks and prompts we used <a
             href="https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py">here</a>.</aside>
+    <h2>The 🍷 FineWeb recipe</h2>
     <p>In the next subsections we will explain each of the steps
         taken to produce the FineWeb dataset.</p>
     <figure class="l-body">