update
Browse files- dist/index.html +5 -3
- dist/main.bundle.js +0 -0
- dist/main.bundle.js.map +0 -0
- src/clusters.js +1 -1
- src/index.html +5 -3
dist/index.html
CHANGED
@@ -639,15 +639,17 @@
|
|
639 |
</div>
|
640 |
<p>Here are the key highlights of the ablation results above:</p>
|
641 |
<ul>
|
642 |
-
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
643 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
|
644 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
645 |
</ul>
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
|
|
648 |
<h2>Next steps</h2>
|
649 |
-
<p>Through our open
|
650 |
-
<p>
|
|
|
651 |
</d-article>
|
652 |
|
653 |
<d-appendix>
|
|
|
639 |
</div>
|
640 |
<p>Here are the key highlights of the ablation results above:</p>
|
641 |
<ul>
|
642 |
+
<li>π FineWeb-Edu <strong>surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks</strong> such as MMLU, ARC, and OpenBookQA.</li>
|
643 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
|
644 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
645 |
</ul>
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
+
|
649 |
<h2>Next steps</h2>
|
650 |
+
<p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
651 |
+
<p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
|
652 |
+
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|
dist/main.bundle.js
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
dist/main.bundle.js.map
CHANGED
The diff for this file is too large to render.
See raw diff
|
|
src/clusters.js
CHANGED
@@ -12,7 +12,7 @@ const DEFAULT_XAXIS = {
|
|
12 |
showgrid: false,
|
13 |
zeroline: false,
|
14 |
title: {
|
15 |
-
text: "
|
16 |
font: {
|
17 |
size: 16,
|
18 |
style: "italic",
|
|
|
12 |
showgrid: false,
|
13 |
zeroline: false,
|
14 |
title: {
|
15 |
+
text: "<a href='https://github.com/huggingface/text-clustering' target='_blank' style='color: inherit;'>The π· FineWeb dataset, clustered and annotated with educational score labels</a>",
|
16 |
font: {
|
17 |
size: 16,
|
18 |
style: "italic",
|
src/index.html
CHANGED
@@ -639,15 +639,17 @@
|
|
639 |
</div>
|
640 |
<p>Here are the key highlights of the ablation results above:</p>
|
641 |
<ul>
|
642 |
-
<li>π FineWeb-Edu surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.</li>
|
643 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
|
644 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
645 |
</ul>
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
|
|
648 |
<h2>Next steps</h2>
|
649 |
-
<p>Through our open
|
650 |
-
<p>
|
|
|
651 |
</d-article>
|
652 |
|
653 |
<d-appendix>
|
|
|
639 |
</div>
|
640 |
<p>Here are the key highlights of the ablation results above:</p>
|
641 |
<ul>
|
642 |
+
<li>π FineWeb-Edu <strong>surpasses π· FineWeb and all other open web datasets, with remarkable improvements on educational benchmarks</strong> such as MMLU, ARC, and OpenBookQA.</li>
|
643 |
<li>It achieves the same performance with significantly less data, requiring 10x fewer tokens compared to C4 and Dolma to match MMLU results.</li>
|
644 |
<li>This demonstrates the effectiveness of using classifiers trained on LLM annotations for large-scale data filtering.</li>
|
645 |
</ul>
|
646 |
<p>Given that a threshold of 2 also demonstrated strong performance while retaining more data, we are releasing an additional dataset filtered with this threshold, containing 5.4 trillion tokens under <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2">HuggingFaceFW/fineweb-edu-score-2</a>.</p>
|
647 |
<p>You can find the two datasets along with the classifier used for the filtering in this <a href="https://huggingface.co/collections/HuggingFaceFW/fineweb-edu-6659c3f3d399d0e1d648adfd">collection</a>.</p>
|
648 |
+
|
649 |
<h2>Next steps</h2>
|
650 |
+
<p>Through our open science efforts we hope to open more and more the block box aronud training good quality large language models as well as give every model trainer the ability to create state-of-the-art LLMs. We're excited to continue iterating on FineWeb and increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
|
651 |
+
<p>Moreover, while English currently dominates the large language model landscape, we're looking forward to applying the learnings we make in project like this one to make high quality training data available as well in other languages and more easily accessible.</p>
|
652 |
+
<p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale in the open.</p>
|
653 |
</d-article>
|
654 |
|
655 |
<d-appendix>
|