thomwolf HF staff commited on
Commit
974f318
β€’
1 Parent(s): 4e5005c
dist/index.html CHANGED
@@ -639,15 +639,17 @@
639
  </div>
640
  <p>Here are the key highlights of the ablation results above:</p>
641
  <ul>
642
- <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
643
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
644
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
645
  </ul>
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
 
648
  <h2>Next steps</h2>
649
- <p>Through our open data efforts we hope to give every model trainer the ability to create state-of-the-art large language models. As part of this process, we plan to continue iterating on FineWeb and to release more specialised filtered subsets of web data, in a fully open and reproducible manner.</p>
650
- <p>While English currently dominates the large language model landscape, we believe that making high quality training data for other languages more easily accessible would allow millions of non english speakers to benefit from these technologies and, as such, will also strive to adapt the FineWeb Recipe to a multilingual version.</p>
 
651
  </d-article>
652
 
653
  <d-appendix>
 
639
  </div>
640
  <p>Here are the key highlights of the ablation results above:</p>
641
  <ul>
642
+ <li>πŸ“š FineWeb-Edu <strong>surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks</strong> such as MMLU, ARC, and OpenBookQA.</li>
643
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
644
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
645
  </ul>
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
+
649
  <h2>Next steps</h2>
650
+ <p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
+ <p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
652
+ <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
653
  </d-article>
654
 
655
  <d-appendix>
dist/main.bundle.js CHANGED
The diff for this file is too large to render. See raw diff
 
dist/main.bundle.js.map CHANGED
The diff for this file is too large to render. See raw diff
 
src/clusters.js CHANGED
@@ -12,7 +12,7 @@ const DEFAULT_XAXIS = {
12
  showgrid: false,
13
  zeroline: false,
14
  title: {
15
- text: "The 🍷 FineWeb dataset, <a href='https://github.com/huggingface/text-clustering' target='_blank' style='color: inherit;'>clustered</a> and annotated with educational score labels",
16
  font: {
17
  size: 16,
18
  style: "italic",
 
12
  showgrid: false,
13
  zeroline: false,
14
  title: {
15
+ text: "<a href='https://github.com/huggingface/text-clustering' target='_blank' style='color: inherit;'>The 🍷 FineWeb dataset, clustered and annotated with educational score labels</a>",
16
  font: {
17
  size: 16,
18
  style: "italic",
src/index.html CHANGED
@@ -639,15 +639,17 @@
639
  </div>
640
  <p>Here are the key highlights of the ablation results above:</p>
641
  <ul>
642
- <li>πŸ“š FineWeb-Edu surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
643
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
644
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
645
  </ul>
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
 
648
  <h2>Next steps</h2>
649
- <p>Through our open data efforts we hope to give every model trainer the ability to create state-of-the-art large language models. As part of this process, we plan to continue iterating on FineWeb and to release more specialised filtered subsets of web data, in a fully open and reproducible manner.</p>
650
- <p>While English currently dominates the large language model landscape, we believe that making high quality training data for other languages more easily accessible would allow millions of non english speakers to benefit from these technologies and, as such, will also strive to adapt the FineWeb Recipe to a multilingual version.</p>
 
651
  </d-article>
652
 
653
  <d-appendix>
 
639
  </div>
640
  <p>Here are the key highlights of the ablation results above:</p>
641
  <ul>
642
+ <li>πŸ“š FineWeb-Edu <strong>surpasses 🍷 FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks</strong> such as MMLU, ARC, and OpenBookQA.</li>
643
  <li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
644
  <li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
645
  </ul>
646
  <p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
647
  <p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
648
+
649
  <h2>Next steps</h2>
650
+ <p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
651
+ <p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
652
+ <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
653
  </d-article>
654
 
655
  <d-appendix>