update
Browse files- dist/index.html +3 -3
- src/index.html +2 -2
dist/index.html
CHANGED
@@ -535,7 +535,7 @@
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
-
<
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
@@ -555,7 +555,7 @@
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
-
<
|
559 |
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
@@ -617,7 +617,7 @@
|
|
617 |
<p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
|
618 |
|
619 |
<h3>Training a classifier</h3>
|
620 |
-
<p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>
|
621 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
622 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
623 |
|
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
+
<h3>The final FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
+
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
|
|
617 |
<p>In terms of open-weight model to use for annotating the data, we experimented with several models including <a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">Mixtral-8x7B-Instruct</a> and <a href="https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1">Mixtral-8x22B-Instruct</a>, <a href="https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct">Llama-3-70B-Instruct</a> as well as a jury gathering these three models<d-cite bibtex-key="verga2024replacing"></d-cite>. In our experimentations, we found that using Llama3 alone gave the most reliable results.</p>
|
618 |
|
619 |
<h3>Training a classifier</h3>
|
620 |
+
<p>To scale our annotation to the trillion tokens of FineWeb, we trained a classifier from the 450k annotation of our Llama3-70B model. The model we used was a <a href="https://huggingface.co/Snowflake/snowflake-arctic-embed-m">Snowflake-arctic-embed</a> embedding model with a classification head with a single regression output on top of it. We trained this model on 450,000 Llama 3 annotations for 20 epochs with a learning rate of 3e-4, freezing the embedding and encoder layers. We saved the checkpoint with the highest F1 score on our held-out validation set of 45k samples, treating Llama 3 annotations as ground-truth. After training, we rounded the scores to integers from <code>0</code> to <code>5</code>.</p>
|
621 |
<p>We then converted the problem to a binary classification task by using a fixed threshold to determine if a file is educational. With a threshold of <code>3</code>, the model achieved an F1 score of 82% on the validation set, indicating strong performance in distinguishing high-quality educational content.</p>
|
622 |
<p>The classifier is available at: <a href="https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier">HuggingFaceFW/fineweb-edu-classifier</a>. The training and inference code is available on <a href="https://github.com/huggingface/cosmopedia/tree/main/classification">GitHub</a>.</p>
|
623 |
|
src/index.html
CHANGED
@@ -535,7 +535,7 @@
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
-
<
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
@@ -555,7 +555,7 @@
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
-
<
|
559 |
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|
|
|
535 |
</div>
|
536 |
<p>These filters allowed us to further improve performance and to, notably, surpass the C4 dataset performance while providing a much larger dataset at the same time.</p>
|
537 |
|
538 |
+
<h3>The final FineWeb dataset</h3>
|
539 |
<p>The final <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> dataset comprises 15T tokens and
|
540 |
includes the following previously mentioned steps, in order, each providing a performance boost on our group
|
541 |
of benchmark tasks:</p>
|
|
|
555 |
<figure><img src="assets/images/filtering_steps.png"/></figure>
|
556 |
<div id="plot-all_filtering_steps"></div>
|
557 |
</div>
|
558 |
+
<h4>Comparisons with other web-scale datasets</h4>
|
559 |
<p>We compared <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">π· FineWeb</a> with the following datasets that are usually considered the highest quality web-scale datasets openly accessible (we also indicate for each the approximate number of tokens in the public version of the dataset):</p>
|
560 |
<ul>
|
561 |
<li><a
|