File size: 924 Bytes
3a6e40d
 
 
 
 
 
 
 
 
f2ea008
 
4f8e3f3
f2ea008
4f8e3f3
9488a06
f2ea008
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
---
title: README
emoji: πŸ‘€
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
---

# πŸ€— HuggingFace 🍷 FineWeb datasets
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web ([CommonCrawl](https://commoncrawl.org/)), released under a permissive license ([ODC-By](https://opendatacommons.org/licenses/by/1-0/)). 

The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.

All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the πŸ€— libraries [`datatrove`](https://github.com/huggingface/datatrove/), [`nanotron`](https://github.com/huggingface/nanotron/) or [`lighteval`](https://github.com/huggingface/lighteval/).


_Currently releasing v1_