guipenedo HF Staff commited on
Commit
60cf22f
·
verified ·
1 Parent(s): d06e14e

Update dist/index.html

Browse files
Files changed (1) hide show
  1. dist/index.html +13 -0
dist/index.html CHANGED
@@ -704,6 +704,19 @@
704
  <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
705
  <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
706
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
707
  </d-article>
708
 
709
  <d-appendix>
 
704
  <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
705
  <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
706
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
707
+
708
+ <h2>Citation</h2>
709
+ <dt-code block language="bibtex">
710
+ @misc{penedo2024finewebdatasetsdecantingweb,
711
+ title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
712
+ author={Guilherme Penedo and Hynek Kydlíček and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
713
+ year={2024},
714
+ eprint={2406.17557},
715
+ archivePrefix={arXiv},
716
+ primaryClass={cs.CL}
717
+ url={https://arxiv.org/abs/2406.17557},
718
+ }
719
+ </dt-code>
720
  </d-article>
721
 
722
  <d-appendix>