khulnasoft commited on
Commit
295f278
Β·
verified Β·
1 Parent(s): b547962

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -545
README.md CHANGED
@@ -4,7 +4,7 @@ task_categories:
4
  - text-generation
5
  language:
6
  - en
7
- pretty_name: FineWeb
8
  size_categories:
9
  - n>1T
10
  configs:
@@ -13,415 +13,20 @@ configs:
13
  - split: train
14
  path: data/*/*
15
  - config_name: sample-10BT
16
- data_files:
17
- - split: train
18
- path: sample/10BT/*
19
- - config_name: sample-100BT
20
- data_files:
21
- - split: train
22
- path: sample/100BT/*
23
- - config_name: sample-350BT
24
- data_files:
25
- - split: train
26
- path: sample/350BT/*
27
- - config_name: CC-MAIN-2024-18
28
- data_files:
29
- - split: train
30
- path: data/CC-MAIN-2024-18/*
31
- - config_name: CC-MAIN-2024-10
32
- data_files:
33
- - split: train
34
- path: data/CC-MAIN-2024-10/*
35
- - config_name: CC-MAIN-2023-50
36
- data_files:
37
- - split: train
38
- path: data/CC-MAIN-2023-50/*
39
- - config_name: CC-MAIN-2023-40
40
- data_files:
41
- - split: train
42
- path: data/CC-MAIN-2023-40/*
43
- - config_name: CC-MAIN-2023-23
44
- data_files:
45
- - split: train
46
- path: data/CC-MAIN-2023-23/*
47
- - config_name: CC-MAIN-2023-14
48
- data_files:
49
- - split: train
50
- path: data/CC-MAIN-2023-14/*
51
- - config_name: CC-MAIN-2023-06
52
- data_files:
53
- - split: train
54
- path: data/CC-MAIN-2023-06/*
55
- - config_name: CC-MAIN-2022-49
56
- data_files:
57
- - split: train
58
- path: data/CC-MAIN-2022-49/*
59
- - config_name: CC-MAIN-2022-40
60
- data_files:
61
- - split: train
62
- path: data/CC-MAIN-2022-40/*
63
- - config_name: CC-MAIN-2022-33
64
- data_files:
65
- - split: train
66
- path: data/CC-MAIN-2022-33/*
67
- - config_name: CC-MAIN-2022-27
68
- data_files:
69
- - split: train
70
- path: data/CC-MAIN-2022-27/*
71
- - config_name: CC-MAIN-2022-21
72
- data_files:
73
- - split: train
74
- path: data/CC-MAIN-2022-21/*
75
- - config_name: CC-MAIN-2022-05
76
- data_files:
77
- - split: train
78
- path: data/CC-MAIN-2022-05/*
79
- - config_name: CC-MAIN-2021-49
80
- data_files:
81
- - split: train
82
- path: data/CC-MAIN-2021-49/*
83
- - config_name: CC-MAIN-2021-43
84
- data_files:
85
- - split: train
86
- path: data/CC-MAIN-2021-43/*
87
- - config_name: CC-MAIN-2021-39
88
- data_files:
89
- - split: train
90
- path: data/CC-MAIN-2021-39/*
91
- - config_name: CC-MAIN-2021-31
92
- data_files:
93
- - split: train
94
- path: data/CC-MAIN-2021-31/*
95
- - config_name: CC-MAIN-2021-25
96
- data_files:
97
- - split: train
98
- path: data/CC-MAIN-2021-25/*
99
- - config_name: CC-MAIN-2021-21
100
- data_files:
101
- - split: train
102
- path: data/CC-MAIN-2021-21/*
103
- - config_name: CC-MAIN-2021-17
104
- data_files:
105
- - split: train
106
- path: data/CC-MAIN-2021-17/*
107
- - config_name: CC-MAIN-2021-10
108
- data_files:
109
- - split: train
110
- path: data/CC-MAIN-2021-10/*
111
- - config_name: CC-MAIN-2021-04
112
- data_files:
113
- - split: train
114
- path: data/CC-MAIN-2021-04/*
115
- - config_name: CC-MAIN-2020-50
116
- data_files:
117
- - split: train
118
- path: data/CC-MAIN-2020-50/*
119
- - config_name: CC-MAIN-2020-45
120
- data_files:
121
- - split: train
122
- path: data/CC-MAIN-2020-45/*
123
- - config_name: CC-MAIN-2020-40
124
- data_files:
125
- - split: train
126
- path: data/CC-MAIN-2020-40/*
127
- - config_name: CC-MAIN-2020-34
128
- data_files:
129
- - split: train
130
- path: data/CC-MAIN-2020-34/*
131
- - config_name: CC-MAIN-2020-29
132
- data_files:
133
- - split: train
134
- path: data/CC-MAIN-2020-29/*
135
- - config_name: CC-MAIN-2020-24
136
- data_files:
137
- - split: train
138
- path: data/CC-MAIN-2020-24/*
139
- - config_name: CC-MAIN-2020-16
140
- data_files:
141
- - split: train
142
- path: data/CC-MAIN-2020-16/*
143
- - config_name: CC-MAIN-2020-10
144
- data_files:
145
- - split: train
146
- path: data/CC-MAIN-2020-10/*
147
- - config_name: CC-MAIN-2020-05
148
- data_files:
149
- - split: train
150
- path: data/CC-MAIN-2020-05/*
151
- - config_name: CC-MAIN-2019-51
152
- data_files:
153
- - split: train
154
- path: data/CC-MAIN-2019-51/*
155
- - config_name: CC-MAIN-2019-47
156
- data_files:
157
- - split: train
158
- path: data/CC-MAIN-2019-47/*
159
- - config_name: CC-MAIN-2019-43
160
- data_files:
161
- - split: train
162
- path: data/CC-MAIN-2019-43/*
163
- - config_name: CC-MAIN-2019-39
164
- data_files:
165
- - split: train
166
- path: data/CC-MAIN-2019-39/*
167
- - config_name: CC-MAIN-2019-35
168
- data_files:
169
- - split: train
170
- path: data/CC-MAIN-2019-35/*
171
- - config_name: CC-MAIN-2019-30
172
- data_files:
173
- - split: train
174
- path: data/CC-MAIN-2019-30/*
175
- - config_name: CC-MAIN-2019-26
176
- data_files:
177
- - split: train
178
- path: data/CC-MAIN-2019-26/*
179
- - config_name: CC-MAIN-2019-22
180
- data_files:
181
- - split: train
182
- path: data/CC-MAIN-2019-22/*
183
- - config_name: CC-MAIN-2019-18
184
- data_files:
185
- - split: train
186
- path: data/CC-MAIN-2019-18/*
187
- - config_name: CC-MAIN-2019-13
188
- data_files:
189
- - split: train
190
- path: data/CC-MAIN-2019-13/*
191
- - config_name: CC-MAIN-2019-09
192
- data_files:
193
- - split: train
194
- path: data/CC-MAIN-2019-09/*
195
- - config_name: CC-MAIN-2019-04
196
- data_files:
197
- - split: train
198
- path: data/CC-MAIN-2019-04/*
199
- - config_name: CC-MAIN-2018-51
200
- data_files:
201
- - split: train
202
- path: data/CC-MAIN-2018-51/*
203
- - config_name: CC-MAIN-2018-47
204
- data_files:
205
- - split: train
206
- path: data/CC-MAIN-2018-47/*
207
- - config_name: CC-MAIN-2018-43
208
- data_files:
209
- - split: train
210
- path: data/CC-MAIN-2018-43/*
211
- - config_name: CC-MAIN-2018-39
212
- data_files:
213
- - split: train
214
- path: data/CC-MAIN-2018-39/*
215
- - config_name: CC-MAIN-2018-34
216
- data_files:
217
- - split: train
218
- path: data/CC-MAIN-2018-34/*
219
- - config_name: CC-MAIN-2018-30
220
- data_files:
221
- - split: train
222
- path: data/CC-MAIN-2018-30/*
223
- - config_name: CC-MAIN-2018-26
224
- data_files:
225
- - split: train
226
- path: data/CC-MAIN-2018-26/*
227
- - config_name: CC-MAIN-2018-22
228
- data_files:
229
- - split: train
230
- path: data/CC-MAIN-2018-22/*
231
- - config_name: CC-MAIN-2018-17
232
- data_files:
233
- - split: train
234
- path: data/CC-MAIN-2018-17/*
235
- - config_name: CC-MAIN-2018-13
236
- data_files:
237
- - split: train
238
- path: data/CC-MAIN-2018-13/*
239
- - config_name: CC-MAIN-2018-09
240
- data_files:
241
- - split: train
242
- path: data/CC-MAIN-2018-09/*
243
- - config_name: CC-MAIN-2018-05
244
- data_files:
245
- - split: train
246
- path: data/CC-MAIN-2018-05/*
247
- - config_name: CC-MAIN-2017-51
248
- data_files:
249
- - split: train
250
- path: data/CC-MAIN-2017-51/*
251
- - config_name: CC-MAIN-2017-47
252
- data_files:
253
- - split: train
254
- path: data/CC-MAIN-2017-47/*
255
- - config_name: CC-MAIN-2017-43
256
- data_files:
257
- - split: train
258
- path: data/CC-MAIN-2017-43/*
259
- - config_name: CC-MAIN-2017-39
260
- data_files:
261
- - split: train
262
- path: data/CC-MAIN-2017-39/*
263
- - config_name: CC-MAIN-2017-34
264
- data_files:
265
- - split: train
266
- path: data/CC-MAIN-2017-34/*
267
- - config_name: CC-MAIN-2017-30
268
- data_files:
269
- - split: train
270
- path: data/CC-MAIN-2017-30/*
271
- - config_name: CC-MAIN-2017-26
272
- data_files:
273
- - split: train
274
- path: data/CC-MAIN-2017-26/*
275
- - config_name: CC-MAIN-2017-22
276
- data_files:
277
- - split: train
278
- path: data/CC-MAIN-2017-22/*
279
- - config_name: CC-MAIN-2017-17
280
- data_files:
281
- - split: train
282
- path: data/CC-MAIN-2017-17/*
283
- - config_name: CC-MAIN-2017-13
284
- data_files:
285
- - split: train
286
- path: data/CC-MAIN-2017-13/*
287
- - config_name: CC-MAIN-2017-09
288
- data_files:
289
- - split: train
290
- path: data/CC-MAIN-2017-09/*
291
- - config_name: CC-MAIN-2017-04
292
- data_files:
293
- - split: train
294
- path: data/CC-MAIN-2017-04/*
295
- - config_name: CC-MAIN-2016-50
296
- data_files:
297
- - split: train
298
- path: data/CC-MAIN-2016-50/*
299
- - config_name: CC-MAIN-2016-44
300
- data_files:
301
- - split: train
302
- path: data/CC-MAIN-2016-44/*
303
- - config_name: CC-MAIN-2016-40
304
- data_files:
305
- - split: train
306
- path: data/CC-MAIN-2016-40/*
307
- - config_name: CC-MAIN-2016-36
308
- data_files:
309
- - split: train
310
- path: data/CC-MAIN-2016-36/*
311
- - config_name: CC-MAIN-2016-30
312
- data_files:
313
- - split: train
314
- path: data/CC-MAIN-2016-30/*
315
- - config_name: CC-MAIN-2016-26
316
- data_files:
317
- - split: train
318
- path: data/CC-MAIN-2016-26/*
319
- - config_name: CC-MAIN-2016-22
320
- data_files:
321
- - split: train
322
- path: data/CC-MAIN-2016-22/*
323
- - config_name: CC-MAIN-2016-18
324
- data_files:
325
- - split: train
326
- path: data/CC-MAIN-2016-18/*
327
- - config_name: CC-MAIN-2016-07
328
- data_files:
329
- - split: train
330
- path: data/CC-MAIN-2016-07/*
331
- - config_name: CC-MAIN-2015-48
332
- data_files:
333
- - split: train
334
- path: data/CC-MAIN-2015-48/*
335
- - config_name: CC-MAIN-2015-40
336
- data_files:
337
- - split: train
338
- path: data/CC-MAIN-2015-40/*
339
- - config_name: CC-MAIN-2015-35
340
- data_files:
341
- - split: train
342
- path: data/CC-MAIN-2015-35/*
343
- - config_name: CC-MAIN-2015-32
344
- data_files:
345
- - split: train
346
- path: data/CC-MAIN-2015-32/*
347
- - config_name: CC-MAIN-2015-27
348
- data_files:
349
- - split: train
350
- path: data/CC-MAIN-2015-27/*
351
- - config_name: CC-MAIN-2015-22
352
- data_files:
353
- - split: train
354
- path: data/CC-MAIN-2015-22/*
355
- - config_name: CC-MAIN-2015-18
356
- data_files:
357
- - split: train
358
- path: data/CC-MAIN-2015-18/*
359
- - config_name: CC-MAIN-2015-14
360
- data_files:
361
- - split: train
362
- path: data/CC-MAIN-2015-14/*
363
- - config_name: CC-MAIN-2015-11
364
- data_files:
365
- - split: train
366
- path: data/CC-MAIN-2015-11/*
367
- - config_name: CC-MAIN-2015-06
368
- data_files:
369
- - split: train
370
- path: data/CC-MAIN-2015-06/*
371
- - config_name: CC-MAIN-2014-52
372
- data_files:
373
- - split: train
374
- path: data/CC-MAIN-2014-52/*
375
- - config_name: CC-MAIN-2014-49
376
- data_files:
377
- - split: train
378
- path: data/CC-MAIN-2014-49/*
379
- - config_name: CC-MAIN-2014-42
380
- data_files:
381
- - split: train
382
- path: data/CC-MAIN-2014-42/*
383
- - config_name: CC-MAIN-2014-41
384
- data_files:
385
- - split: train
386
- path: data/CC-MAIN-2014-41/*
387
- - config_name: CC-MAIN-2014-35
388
- data_files:
389
- - split: train
390
- path: data/CC-MAIN-2014-35/*
391
- - config_name: CC-MAIN-2014-23
392
- data_files:
393
- - split: train
394
- path: data/CC-MAIN-2014-23/*
395
- - config_name: CC-MAIN-2014-15
396
- data_files:
397
- - split: train
398
- path: data/CC-MAIN-2014-15/*
399
- - config_name: CC-MAIN-2014-10
400
- data_files:
401
- - split: train
402
- path: data/CC-MAIN-2014-10/*
403
- - config_name: CC-MAIN-2013-48
404
- data_files:
405
- - split: train
406
- path: data/CC-MAIN-2013-48/*
407
- - config_name: CC-MAIN-2013-20
408
- data_files:
409
- - split: train
410
- path: data/CC-MAIN-2013-20/*
411
  ---
412
- # 🍷 FineWeb
413
  <center>
414
- <img src="https://huggingface.co/datasets/HuggingFaceFW/admin/resolve/main/fineweb-logo.png" alt="FineWeb: The finest collection of data the web has to offer">
415
  </center>
416
 
417
  > 15 trillion tokens of the finest data the 🌐 web has to offer
418
 
419
  # Table of Contents
420
- - [🍷 FineWeb](#-fineweb)
421
  * [What is it?](#what-is-it)
422
  * [What is being released?](#what-is-being-released)
423
  * [Changelog](#changelog)
424
- * [How to download and use 🍷 FineWeb](#how-to-download-and-use-🍷-fineweb)
425
  + [Using 🏭 `datatrove`](#using-datatrove)
426
  + [Using `huggingface_hub`](#using-huggingface_hub)
427
  + [Using `datasets`](#using-datasets)
@@ -430,7 +35,7 @@ configs:
430
  + [Hyper-parameters for ablation models](#hyper-parameters-for-ablation-models)
431
  + [Ablation evaluation benchmarks](#ablation-evaluation-benchmarks)
432
  + [Comparison with other datasets](#comparison-with-other-datasets)
433
- - [Dataset card for 🍷 FineWeb](#dataset-card-for-🍷-fineweb)
434
  * [Dataset Summary](#dataset-summary)
435
  * [Dataset Structure](#dataset-structure)
436
  + [Data Instances](#data-instances)
@@ -453,17 +58,17 @@ configs:
453
 
454
  ## What is it?
455
 
456
- The 🍷 FineWeb dataset consists of more than **15T tokens** of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library, our large scale data processing library.
457
 
458
- 🍷 FineWeb was originally meant to be a fully open replication of πŸ¦… [RefinedWeb](https://huggingface.co/papers/2306.01116), with a release of the **full dataset** under the **ODC-By 1.0 license**. However, by carefully adding additional filtering steps, we managed to push the performance of 🍷 FineWeb well above that of the original πŸ¦… RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of [benchmark tasks](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py).
459
 
460
- That said, we think there is still room for additional filtering and improvement and intend to continue exploring how to improve the dataset quality in coming versions of 🍷 FineWeb.
461
 
462
  ## What is being released?
463
 
464
- Along with the dataset, which includes all CommonCrawl dumps since 2013, we also share all the code needed to fully reproduce our processing setup using the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library [here](https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py). To enable full replication of our results, we have also published the small ablation models we have trained using [`nanotron`](https://github.com/huggingface/nanotron/) to validate the dataset and compare it with other reference datasets. You will find them [here](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32), with checkpoints every 1000 steps. We have also published our evaluation results [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv). Our evaluation setup is available [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py).
465
 
466
- You will find details on the different processing decisions we took and some interesting explorations of deduplication methods on our [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
467
 
468
  ## Changelog
469
  _Previous versions remain available in the branch `version name`._
@@ -471,7 +76,7 @@ _Previous versions remain available in the branch `version name`._
471
  - **v1.1.0 (31-05-2024):** We reprocessed and reuploaded 11 dumps, `CC-MAIN-2021-49` to `CC-MAIN-2023-40`, as we found a bug on their deduplication. We also added the most recent dump: `CC-MAIN-2024-18`, crawled over April 2024. Expect a small perf improvement
472
  - **v1.0.0 (21-04-2024):** Initial version
473
 
474
- ## How to download and use 🍷 FineWeb
475
 
476
  You can load the full dataset or a specific crawl/dump (see table below). Dumps have the format `CC-MAIN-(year)-(week number)`.
477
 
@@ -489,9 +94,9 @@ Along with config `default` (all the data), and the configs for each individual
489
  from datatrove.pipeline.readers import ParquetReader
490
 
491
  # limit determines how many documents will be streamed (remove for all)
492
- # to fetch a specific dump: hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2024-10
493
  # replace "data" with "sample/100BT" to use the 100BT sample
494
- data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb/data", limit=1000)
495
  for document in data_reader():
496
  # do something with document
497
  print(document)
@@ -508,7 +113,7 @@ from datatrove.pipeline.writers import JsonlWriter
508
  pipeline_exec = LocalPipelineExecutor(
509
  pipeline=[
510
  # replace "data/CC-MAIN-2024-10" with "sample/100BT" to use the 100BT sample
511
- ParquetReader("hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2024-10", limit=1000),
512
  LambdaFilter(lambda doc: "hugging" in doc.text),
513
  JsonlWriter("some-output-path")
514
  ],
@@ -522,9 +127,9 @@ pipeline_exec.run()
522
  ```python
523
  from huggingface_hub import snapshot_download
524
  folder = snapshot_download(
525
- "HuggingFaceFW/fineweb",
526
  repo_type="dataset",
527
- local_dir="./fineweb/",
528
  # replace "data/CC-MAIN-2023-50/*" with "sample/100BT/*" to use the 100BT sample
529
  allow_patterns="data/CC-MAIN-2023-50/*")
530
  ```
@@ -536,115 +141,12 @@ For faster downloads, make sure to install `pip install huggingface_hub[hf_trans
536
  ```python
537
  from datasets import load_dataset
538
  # use name="sample-10BT" to use the 10BT sample
539
- fw = load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split="train", streaming=True)
540
  ```
541
 
542
- ## Breakdown by dump/crawl
543
-
544
- | Dump | Time period | Disk size (GB) | gpt2 tokens (billions) |
545
- | --- | --- | --- | --- |
546
- | CC-MAIN-2024-18 | April 2024 | 417.6 | 154.4 |
547
- | CC-MAIN-2024-10 | February/March 2024 | 432.0 | 157.2 |
548
- | CC-MAIN-2023-50 | November/December 2023 | 650.0 | 239.7 |
549
- | CC-MAIN-2023-40 | September/October 2023 | 668.7 | 252.0 |
550
- | CC-MAIN-2023-23 | May/June 2023 | 654.4 | 249.2 |
551
- | CC-MAIN-2023-14 | March/April 2023 | 621.3 | 236.5 |
552
- | CC-MAIN-2023-06 | January/February 2023 | 621.9 | 233.9 |
553
- | CC-MAIN-2022-49 | November/December 2022 | 631.2 | 237.5 |
554
- | CC-MAIN-2022-40 | September/October 2022 | 606.4 | 228.7 |
555
- | CC-MAIN-2022-33 | August 2022 | 434.6 | 163.5 |
556
- | CC-MAIN-2022-27 | June/July 2022 | 574.9 | 216.1 |
557
- | CC-MAIN-2022-21 | May 2022 | 646.4 | 242.7 |
558
- | CC-MAIN-2022-05 | January 2022 | 520.1 | 195.4 |
559
- | CC-MAIN-2021-49 | November/December 2021 | 413.7 | 155.5 |
560
- | CC-MAIN-2021-43 | October 2021 | 601.5 | 221.0 |
561
- | CC-MAIN-2021-43 | October 2021 | 601.5 | 221.0 |
562
- | CC-MAIN-2021-39 | September 2021 | 518.9 | 190.6 |
563
- | CC-MAIN-2021-31 | July/August 2021 | 593.9 | 217.7 |
564
- | CC-MAIN-2021-25 | June 2021 | 424.4 | 155.7 |
565
- | CC-MAIN-2021-21 | May 2021 | 455.9 | 167.4 |
566
- | CC-MAIN-2021-17 | April 2021 | 556.0 | 204.1 |
567
- | CC-MAIN-2021-10 | February/March 2021 | 463.2 | 169.6 |
568
- | CC-MAIN-2021-04 | January 2021 | 562.4 | 205.4 |
569
- | CC-MAIN-2020-50 | November/December 2020 | 422.8 | 154.3 |
570
- | CC-MAIN-2020-45 | October 2020 | 426.9 | 155.8 |
571
- | CC-MAIN-2020-40 | September 2020 | 555.5 | 202.4 |
572
- | CC-MAIN-2020-34 | August 2020 | 379.6 | 138.7 |
573
- | CC-MAIN-2020-29 | July 2020 | 489.6 | 178.7 |
574
- | CC-MAIN-2020-24 | May/June 2020 | 398.7 | 145.1 |
575
- | CC-MAIN-2020-16 | March/April 2020 | 454.0 | 165.6 |
576
- | CC-MAIN-2020-10 | February 2020 | 369.6 | 134.7 |
577
- | CC-MAIN-2020-05 | January 2020 | 483.3 | 176.4 |
578
- | CC-MAIN-2019-51 | December 2019 | 359.3 | 130.9 |
579
- | CC-MAIN-2019-47 | November 2019 | 395.4 | 144.0 |
580
- | CC-MAIN-2019-43 | October 2019 | 422.3 | 153.9 |
581
- | CC-MAIN-2019-39 | September 2019 | 394.4 | 143.7 |
582
- | CC-MAIN-2019-35 | August 2019 | 454.2 | 165.4 |
583
- | CC-MAIN-2019-30 | July 2019 | 416.6 | 151.5 |
584
- | CC-MAIN-2019-26 | June 2019 | 412.9 | 150.1 |
585
- | CC-MAIN-2019-22 | May 2019 | 432.8 | 157.4 |
586
- | CC-MAIN-2019-18 | April 2019 | 426.7 | 155.3 |
587
- | CC-MAIN-2019-13 | March 2019 | 417.8 | 152.1 |
588
- | CC-MAIN-2019-09 | February 2019 | 467.2 | 169.9 |
589
- | CC-MAIN-2019-04 | January 2019 | 438.1 | 158.7 |
590
- | CC-MAIN-2018-51 | December 2018 | 498.6 | 180.8 |
591
- | CC-MAIN-2018-47 | November 2018 | 437.7 | 158.9 |
592
- | CC-MAIN-2018-43 | October 2018 | 468.8 | 169.9 |
593
- | CC-MAIN-2018-39 | September 2018 | 429.2 | 155.2 |
594
- | CC-MAIN-2018-34 | August 2018 | 408.2 | 148.0 |
595
- | CC-MAIN-2018-30 | July 2018 | 501.5 | 181.4 |
596
- | CC-MAIN-2018-26 | June 2018 | 467.5 | 170.0 |
597
- | CC-MAIN-2018-22 | May 2018 | 398.6 | 144.2 |
598
- | CC-MAIN-2018-17 | April 2018 | 435.1 | 158.1 |
599
- | CC-MAIN-2018-13 | March 2018 | 471.5 | 171.5 |
600
- | CC-MAIN-2018-09 | February 2018 | 490.2 | 178.0 |
601
- | CC-MAIN-2018-05 | January 2018 | 493.5 | 180.7 |
602
- | CC-MAIN-2017-51 | December 2017 | 442.6 | 161.5 |
603
- | CC-MAIN-2017-47 | November 2017 | 457.9 | 167.1 |
604
- | CC-MAIN-2017-43 | October 2017 | 535.6 | 194.9 |
605
- | CC-MAIN-2017-39 | September 2017 | 444.5 | 162.3 |
606
- | CC-MAIN-2017-34 | August 2017 | 503.2 | 183.4 |
607
- | CC-MAIN-2017-30 | July 2017 | 439.2 | 161.2 |
608
- | CC-MAIN-2017-26 | June 2017 | 491.5 | 179.8 |
609
- | CC-MAIN-2017-22 | May 2017 | 441.0 | 161.5 |
610
- | CC-MAIN-2017-17 | April 2017 | 596.8 | 218.6 |
611
- | CC-MAIN-2017-13 | March 2017 | 579.8 | 212.1 |
612
- | CC-MAIN-2017-09 | February 2017 | 492.2 | 180.2 |
613
- | CC-MAIN-2017-04 | January 2017 | 474.3 | 174.4 |
614
- | CC-MAIN-2016-50 | December 2016 | 448.9 | 165.4 |
615
- | CC-MAIN-2016-44 | October 2016 | 467.8 | 172.0 |
616
- | CC-MAIN-2016-40 | September 2016 | 386.1 | 142.8 |
617
- | CC-MAIN-2016-36 | August 2016 | 339.6 | 126.3 |
618
- | CC-MAIN-2016-30 | July 2016 | 346.0 | 128.4 |
619
- | CC-MAIN-2016-26 | June 2016 | 256.5 | 95.5 |
620
- | CC-MAIN-2016-22 | May 2016 | 310.9 | 115.4 |
621
- | CC-MAIN-2016-18 | April 2016 | 298.1 | 110.8 |
622
- | CC-MAIN-2016-07 | February 2016 | 342.7 | 127.2 |
623
- | CC-MAIN-2015-48 | November 2015 | 353.9 | 131.3 |
624
- | CC-MAIN-2015-40 | September 2015 | 284.0 | 105.5 |
625
- | CC-MAIN-2015-35 | August 2015 | 359.4 | 133.2 |
626
- | CC-MAIN-2015-32 | July 2015 | 352.4 | 130.1 |
627
- | CC-MAIN-2015-27 | June 2015 | 335.5 | 124.0 |
628
- | CC-MAIN-2015-22 | May 2015 | 380.2 | 140.4 |
629
- | CC-MAIN-2015-18 | April 2015 | 389.0 | 143.8 |
630
- | CC-MAIN-2015-14 | March 2015 | 337.5 | 124.5 |
631
- | CC-MAIN-2015-11 | February 2015 | 361.4 | 133.3 |
632
- | CC-MAIN-2015-06 | January 2015 | 356.1 | 131.3 |
633
- | CC-MAIN-2014-52 | December 2014 | 388.5 | 143.3 |
634
- | CC-MAIN-2014-49 | November 2014 | 319.9 | 117.7 |
635
- | CC-MAIN-2014-42 | October 2014 | 371.1 | 136.4 |
636
- | CC-MAIN-2014-41 | September 2014 | 408.1 | 150.2 |
637
- | CC-MAIN-2014-35 | August 2014 | 395.7 | 145.6 |
638
- | CC-MAIN-2014-23 | July 2014 | 425.0 | 156.5 |
639
- | CC-MAIN-2014-15 | April 2014 | 369.1 | 135.7 |
640
- | CC-MAIN-2014-10 | March 2014 | 396.2 | 146.2 |
641
- | CC-MAIN-2013-48 | Winter 2013 | 396.8 | 145.9 |
642
- | CC-MAIN-2013-20 | Summer 2013 | 393.9 | 144.5 |
643
- | Total | | 43056.6 | 15835.2 |
644
-
645
  ## Dataset performance evaluation and ablations
646
 
647
- We conducted our dataset performance ablations and evaluations by training a series of 1.8B parameters models on 27 billion tokens. To compare 🍷 FineWeb with other datasets, we also trained one of these 1.8B models per target dataset, on 350 billion tokens sampled from it (or the entire dataset when its size was < 350 billion tokens).
648
 
649
  ### Hyper-parameters for ablation models
650
 
@@ -672,11 +174,11 @@ We used the following list of benchmark for our ablation runs:
672
 
673
  To compare runs we consider an aggregate score, the average of the scores for these tasks.
674
 
675
- The prompts for all these benchmarks are formatted in order to compute and compare the log-likelihood of the full answers for each multiple choice question. All the implementation details for the benchmarks are available in `lighteval` [here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/lighteval_tasks.py).
676
 
677
  ### Comparison with other datasets
678
 
679
- We compared 🍷 FineWeb with the following datasets:
680
 
681
  - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
682
  - [C4](https://huggingface.co/datasets/allenai/c4)
@@ -685,25 +187,25 @@ We compared 🍷 FineWeb with the following datasets:
685
  - [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)
686
  - [RedPajama2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) (deduplicated)
687
 
688
- You will find these models on [this collection](https://huggingface.co/collections/HuggingFaceFW/ablation-models-662457b0d213e8c14fe47f32). We have uploaded checkpoints at every 1000 training steps. You will also find our full [evaluation results here](https://huggingface.co/datasets/HuggingFaceFW/fineweb/blob/main/eval_results.csv).
689
 
690
  <center>
691
- <img src="https://huggingface.co/datasets/HuggingFaceFW/admin/resolve/main/fineweb-ablations.png" alt="ablations">
692
  </center>
693
 
694
  _Note:_ The plot is smoothed by averaging 5k steps in a rolling window.
695
 
696
- # Dataset card for 🍷 FineWeb
697
 
698
  ## Dataset Description
699
 
700
- - **Homepage and Repository:** [https://huggingface.co/datasets/HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
701
  - **Point of Contact:** please create a discussion on the Community tab
702
  - **License:** Open Data Commons Attribution License (ODC-By) v1.0
703
 
704
  ### Dataset Summary
705
 
706
- This dataset was created by processing 96 [CommonCrawl](https://commoncrawl.org/) dumps comprising web data crawled from the summer of 2013 to April of 2024. 🍷 FineWeb includes a variety of domains and topics in English and is primarily intended to be used as a research artifact on public data in the context of pretraining dataset for large language models. The CommonCrawl data was carefully processed, filtered and deduplicated with the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library, resulting in the largest publicly available clean LLM pretraining dataset, counting around 15 trillion tokens (gpt2 tokenizer).
707
 
708
  ## Dataset Structure
709
 
@@ -746,7 +248,7 @@ From experiments we have run, not all dumps give the same performance. For relat
746
 
747
  ### Curation Rationale
748
 
749
- While multiple open-weights models have regularly been released in recent months, these releases often do not include the model's training data. With 🍷 FineWeb we aim to provide the open source community with a very large clean pretraining dataset that can be used to push the envelope on truly open source models (open source models where data is also released).
750
 
751
  ### Source Data
752
 
@@ -754,12 +256,12 @@ The source data consists of webpages crawled by the CommonCrawl foundation over
754
 
755
  We then extracted the main page text from the html of each webpage, carefully filtered each sample and deduplicated each individual CommonCrawl dump/crawl.
756
 
757
- While we originally intended to deduplicate the dataset as a whole, our ablations showed that training on a sampling of individually deduplicated dumps/crawls outperformed training on a sampling of all the dumps/crawls deduplicated together. You will find more details on our [blogpost](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1).
758
 
759
  ### Data processing steps
760
 
761
  We used the 🏭 `datatrove` library to process the data.
762
- You can find a **working script** that launches the [entire processing pipeline here](https://github.com/huggingface/datatrove/blob/main/examples/fineweb.py).
763
 
764
  The data processing pipeline consists of:
765
 
@@ -769,7 +271,7 @@ The data processing pipeline consists of:
769
  4. Quality filtering
770
  1. [Gopher Repetition /](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/gopher_repetition_filter.py) [Quality](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/gopher_quality_filter.py)
771
  2. [C4 Quality filters](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/c4_quality_filter.py) except `terminal_punct` rule
772
- 3. [FineWeb custom filters](https://github.com/huggingface/datatrove/blob/05194d3960741e7d5c0bd0d6dd69d44514622549/src/datatrove/pipeline/filters/fineweb_quality_filter.py), consisting of heuristics for removing list-like documents, documents with repeated lines and documents with likely wrong line formatting.
773
  5. [MinHash deduplication](https://github.com/huggingface/datatrove/blob/6daa5e879e06b21e6886b37e2b1be4ae58a658b6/src/datatrove/pipeline/dedup/minhash.py) with each crawl deduplicated individually (5-grams, 14x8 hash functions)
774
  6. [PII Formatting](https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/formatters/pii.py) to anonymize email and public IP addresses
775
 
@@ -783,7 +285,7 @@ We anonymize email addresses and public IP addresses.
783
 
784
  For emails, we apply a regex pattern and replace any occurrence of an email address with either `[email protected]` or `[email protected]`. For IP addresses, we also employ a regex pattern and then further filter to only anonymize IP addresses [allocated for public networks](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml). Matched IP addresses are then replaced with one of the following randomly generated IP addresses, which at the time of dataset creation were not responding to ping requests: `22.214.171.124`, `126.96.36.199`, `188.8.131.52`, `184.108.40.206`, `220.127.116.11`, and `18.104.22.168`. We decided against applying regex patterns for phone numbers due to the high false positive rate.
785
 
786
- Despite our efforts, given that 🍷 FineWeb is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present. If you find your own PII in 🍷 FineWeb and would like it removed, please fill out our [PII removal form](https://forms.gle/VyNT3ZAUPZjPuWp39).
787
 
788
  ## Considerations for Using the Data
789
 
@@ -791,17 +293,17 @@ Despite our efforts, given that 🍷 FineWeb is sourced from the internet at lar
791
 
792
  With the release of this dataset we aim to make model training more accessible to the machine learning community at large.
793
 
794
- While multiple open-weights models with strong performance have been publicly released in the past, more often than not these releases are not accompanied by the corresponding training dataset. This is unfortunate as the dataset specificities and characteristics have been demonstrated to have a very large impact and role in the performances of the models. As the creation of a high quality training dataset is a fundamental requirement to training an LLM capable of excelling at downstream tasks, with 🍷 FineWeb we (a) not only make the dataset creation process more transparent, by sharing our entire processing setup including the codebase used, we also (b) help alleviate the costs of dataset curation, both in time and in compute, for model creators by publicly releasing our dataset with the community.
795
 
796
  ### Discussion of Biases
797
 
798
- Efforts were made to minimize the amount of NSFW and toxic content present in the dataset by employing filtering on the URL level. However, there are still a significant number of documents present in the final dataset that could be considered toxic or contain harmful content. As 🍷 FineWeb was sourced from the web as a whole, any harmful biases typically present in it may be reproduced on our dataset.
799
 
800
  We deliberately avoided using machine learning filtering methods that define text quality based on the similarity to a β€œgold” source such as wikipedia or toxicity classifiers as these methods have been known to [disproportionately remove content in specific dialects](https://aclanthology.org/D16-1120/) and [overclassify as toxic text related to specific social identities](https://arxiv.org/pdf/2109.07445.pdf), respectively.
801
 
802
  ### Other Known Limitations
803
 
804
- As a consequence of some of the filtering steps applied, it is likely that code content is not prevalent in our dataset. If you are training a model that should also perform code tasks, we recommend you use 🍷 FineWeb with a code dataset, such as [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2). You should also probably consider complementing 🍷 FineWeb with specialized curated sources (such as Wikipedia, for example) as they will likely have better formatting than the wikipedia content included in 🍷 FineWeb (we did not tailor the processing to individual websites).
805
 
806
  ## Additional Information
807
 
@@ -811,18 +313,6 @@ The dataset is released under the **Open Data Commons Attribution License (ODC-B
811
 
812
  ### Future work
813
 
814
- We plan to not only continue but also expand our efforts to create open-source high quality training datasets and to improve 🍷 FineWeb itself in future iterations.
815
 
816
- ## Citation Information
817
- Paper on [arXiv](https://arxiv.org/abs/2406.17557)
818
- ```
819
- @misc{penedo2024finewebdatasetsdecantingweb,
820
- title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
821
- author={Guilherme Penedo and Hynek Kydlíček and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
822
- year={2024},
823
- eprint={2406.17557},
824
- archivePrefix={arXiv},
825
- primaryClass={cs.CL}
826
- url={https://arxiv.org/abs/2406.17557},
827
- }
828
  ```
 
4
  - text-generation
5
  language:
6
  - en
7
+ pretty_name: Spidder
8
  size_categories:
9
  - n>1T
10
  configs:
 
13
  - split: train
14
  path: data/*/*
15
  - config_name: sample-10BT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ---
17
+ # 🍷 Spidder
18
  <center>
19
+ <img src="https://huggingface.co/datasets/cvedb/admin/resolve/main/spidder-logo.png" alt="Spidder: The finest collection of data the web has to offer">
20
  </center>
21
 
22
  > 15 trillion tokens of the finest data the 🌐 web has to offer
23
 
24
  # Table of Contents
25
+ - [🍷 Spidder](#-spidder)
26
  * [What is it?](#what-is-it)
27
  * [What is being released?](#what-is-being-released)
28
  * [Changelog](#changelog)
29
+ * [How to download and use 🍷 Spidder](#how-to-download-and-use-🍷-spidder)
30
  + [Using 🏭 `datatrove`](#using-datatrove)
31
  + [Using `huggingface_hub`](#using-huggingface_hub)
32
  + [Using `datasets`](#using-datasets)
 
35
  + [Hyper-parameters for ablation models](#hyper-parameters-for-ablation-models)
36
  + [Ablation evaluation benchmarks](#ablation-evaluation-benchmarks)
37
  + [Comparison with other datasets](#comparison-with-other-datasets)
38
+ - [Dataset card for 🍷 Spidder](#dataset-card-for-🍷-spidder)
39
  * [Dataset Summary](#dataset-summary)
40
  * [Dataset Structure](#dataset-structure)
41
  + [Data Instances](#data-instances)
 
58
 
59
  ## What is it?
60
 
61
+ The 🍷 Spidder dataset consists of more than **15T tokens** of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library, our large scale data processing library.
62
 
63
+ 🍷 Spidder was originally meant to be a fully open replication of πŸ¦… [RefinedWeb](https://huggingface.co/papers/2306.01116), with a release of the **full dataset** under the **ODC-By 1.0 license**. However, by carefully adding additional filtering steps, we managed to push the performance of 🍷 Spidder well above that of the original πŸ¦… RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of [benchmark tasks](https://huggingface.co/datasets/cvedb/spidder/blob/main/lighteval_tasks.py).
64
 
65
+ That said, we think there is still room for additional filtering and improvement and intend to continue exploring how to improve the dataset quality in coming versions of 🍷 Spidder.
66
 
67
  ## What is being released?
68
 
69
+ Along with the dataset, which includes all CommonCrawl dumps since 2013, we also share all the code needed to fully reproduce our processing setup using the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library [here](https://github.com/huggingface/datatrove/blob/main/examples/spidder.py). To enable full replication of our results, we have also published the small ablation models we have trained using [`nanotron`](https://github.com/huggingface/nanotron/) to validate the dataset and compare it with other reference datasets. You will find them [here](https://huggingface.co/collections/cvedb/ablation-models-662457b0d213e8c14fe47f32), with checkpoints every 1000 steps. We have also published our evaluation results [here](https://huggingface.co/datasets/cvedb/spidder/blob/main/eval_results.csv). Our evaluation setup is available [here](https://huggingface.co/datasets/cvedb/spidder/blob/main/lighteval_tasks.py).
70
 
71
+ You will find details on the different processing decisions we took and some interesting explorations of deduplication methods on our [blogpost](https://huggingface.co/spaces/cvedb/blogpost-spidder-v1).
72
 
73
  ## Changelog
74
  _Previous versions remain available in the branch `version name`._
 
76
  - **v1.1.0 (31-05-2024):** We reprocessed and reuploaded 11 dumps, `CC-MAIN-2021-49` to `CC-MAIN-2023-40`, as we found a bug on their deduplication. We also added the most recent dump: `CC-MAIN-2024-18`, crawled over April 2024. Expect a small perf improvement
77
  - **v1.0.0 (21-04-2024):** Initial version
78
 
79
+ ## How to download and use 🍷 Spidder
80
 
81
  You can load the full dataset or a specific crawl/dump (see table below). Dumps have the format `CC-MAIN-(year)-(week number)`.
82
 
 
94
  from datatrove.pipeline.readers import ParquetReader
95
 
96
  # limit determines how many documents will be streamed (remove for all)
97
+ # to fetch a specific dump: hf://datasets/cvedb/spidder/data/CC-MAIN-2024-10
98
  # replace "data" with "sample/100BT" to use the 100BT sample
99
+ data_reader = ParquetReader("hf://datasets/cvedb/spidder/data", limit=1000)
100
  for document in data_reader():
101
  # do something with document
102
  print(document)
 
113
  pipeline_exec = LocalPipelineExecutor(
114
  pipeline=[
115
  # replace "data/CC-MAIN-2024-10" with "sample/100BT" to use the 100BT sample
116
+ ParquetReader("hf://datasets/cvedb/spidder/data/CC-MAIN-2024-10", limit=1000),
117
  LambdaFilter(lambda doc: "hugging" in doc.text),
118
  JsonlWriter("some-output-path")
119
  ],
 
127
  ```python
128
  from huggingface_hub import snapshot_download
129
  folder = snapshot_download(
130
+ "cvedb/spidder",
131
  repo_type="dataset",
132
+ local_dir="./spidder/",
133
  # replace "data/CC-MAIN-2023-50/*" with "sample/100BT/*" to use the 100BT sample
134
  allow_patterns="data/CC-MAIN-2023-50/*")
135
  ```
 
141
  ```python
142
  from datasets import load_dataset
143
  # use name="sample-10BT" to use the 10BT sample
144
+ fw = load_dataset("cvedb/spidder", name="CC-MAIN-2024-10", split="train", streaming=True)
145
  ```
146
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
  ## Dataset performance evaluation and ablations
148
 
149
+ We conducted our dataset performance ablations and evaluations by training a series of 1.8B parameters models on 27 billion tokens. To compare 🍷 Spidder with other datasets, we also trained one of these 1.8B models per target dataset, on 350 billion tokens sampled from it (or the entire dataset when its size was < 350 billion tokens).
150
 
151
  ### Hyper-parameters for ablation models
152
 
 
174
 
175
  To compare runs we consider an aggregate score, the average of the scores for these tasks.
176
 
177
+ The prompts for all these benchmarks are formatted in order to compute and compare the log-likelihood of the full answers for each multiple choice question. All the implementation details for the benchmarks are available in `lighteval` [here](https://huggingface.co/datasets/cvedb/spidder/blob/main/lighteval_tasks.py).
178
 
179
  ### Comparison with other datasets
180
 
181
+ We compared 🍷 Spidder with the following datasets:
182
 
183
  - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)
184
  - [C4](https://huggingface.co/datasets/allenai/c4)
 
187
  - [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B)
188
  - [RedPajama2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) (deduplicated)
189
 
190
+ You will find these models on [this collection](https://huggingface.co/collections/cvedb/ablation-models-662457b0d213e8c14fe47f32). We have uploaded checkpoints at every 1000 training steps. You will also find our full [evaluation results here](https://huggingface.co/datasets/cvedb/spidder/blob/main/eval_results.csv).
191
 
192
  <center>
193
+ <img src="https://huggingface.co/datasets/cvedb/admin/resolve/main/spidder-ablations.png" alt="ablations">
194
  </center>
195
 
196
  _Note:_ The plot is smoothed by averaging 5k steps in a rolling window.
197
 
198
+ # Dataset card for 🍷 Spidder
199
 
200
  ## Dataset Description
201
 
202
+ - **Homepage and Repository:** [https://huggingface.co/datasets/cvedb/spidder](https://huggingface.co/datasets/cvedb/spidder)
203
  - **Point of Contact:** please create a discussion on the Community tab
204
  - **License:** Open Data Commons Attribution License (ODC-By) v1.0
205
 
206
  ### Dataset Summary
207
 
208
+ This dataset was created by processing 96 [CommonCrawl](https://commoncrawl.org/) dumps comprising web data crawled from the summer of 2013 to April of 2024. 🍷 Spidder includes a variety of domains and topics in English and is primarily intended to be used as a research artifact on public data in the context of pretraining dataset for large language models. The CommonCrawl data was carefully processed, filtered and deduplicated with the 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) library, resulting in the largest publicly available clean LLM pretraining dataset, counting around 15 trillion tokens (gpt2 tokenizer).
209
 
210
  ## Dataset Structure
211
 
 
248
 
249
  ### Curation Rationale
250
 
251
+ While multiple open-weights models have regularly been released in recent months, these releases often do not include the model's training data. With 🍷 Spidder we aim to provide the open source community with a very large clean pretraining dataset that can be used to push the envelope on truly open source models (open source models where data is also released).
252
 
253
  ### Source Data
254
 
 
256
 
257
  We then extracted the main page text from the html of each webpage, carefully filtered each sample and deduplicated each individual CommonCrawl dump/crawl.
258
 
259
+ While we originally intended to deduplicate the dataset as a whole, our ablations showed that training on a sampling of individually deduplicated dumps/crawls outperformed training on a sampling of all the dumps/crawls deduplicated together. You will find more details on our [blogpost](https://huggingface.co/spaces/cvedb/blogpost-spidder-v1).
260
 
261
  ### Data processing steps
262
 
263
  We used the 🏭 `datatrove` library to process the data.
264
+ You can find a **working script** that launches the [entire processing pipeline here](https://github.com/huggingface/datatrove/blob/main/examples/spidder.py).
265
 
266
  The data processing pipeline consists of:
267
 
 
271
  4. Quality filtering
272
  1. [Gopher Repetition /](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/gopher_repetition_filter.py) [Quality](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/gopher_quality_filter.py)
273
  2. [C4 Quality filters](https://github.com/huggingface/datatrove/blob/9a88bebc86a554f8521faa70b12ad4fa0c227537/src/datatrove/pipeline/filters/c4_quality_filter.py) except `terminal_punct` rule
274
+ 3. [Spidder custom filters](https://github.com/huggingface/datatrove/blob/05194d3960741e7d5c0bd0d6dd69d44514622549/src/datatrove/pipeline/filters/spidder_quality_filter.py), consisting of heuristics for removing list-like documents, documents with repeated lines and documents with likely wrong line formatting.
275
  5. [MinHash deduplication](https://github.com/huggingface/datatrove/blob/6daa5e879e06b21e6886b37e2b1be4ae58a658b6/src/datatrove/pipeline/dedup/minhash.py) with each crawl deduplicated individually (5-grams, 14x8 hash functions)
276
  6. [PII Formatting](https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/formatters/pii.py) to anonymize email and public IP addresses
277
 
 
285
 
286
  For emails, we apply a regex pattern and replace any occurrence of an email address with either `[email protected]` or `[email protected]`. For IP addresses, we also employ a regex pattern and then further filter to only anonymize IP addresses [allocated for public networks](https://www.iana.org/assignments/iana-ipv4-special-registry/iana-ipv4-special-registry.xhtml). Matched IP addresses are then replaced with one of the following randomly generated IP addresses, which at the time of dataset creation were not responding to ping requests: `22.214.171.124`, `126.96.36.199`, `188.8.131.52`, `184.108.40.206`, `220.127.116.11`, and `18.104.22.168`. We decided against applying regex patterns for phone numbers due to the high false positive rate.
287
 
288
+ Despite our efforts, given that 🍷 Spidder is sourced from the internet at large, it is very likely that some personable identifiable information (PII) will be present. If you find your own PII in 🍷 Spidder and would like it removed, please fill out our [PII removal form](https://forms.gle/VyNT3ZAUPZjPuWp39).
289
 
290
  ## Considerations for Using the Data
291
 
 
293
 
294
  With the release of this dataset we aim to make model training more accessible to the machine learning community at large.
295
 
296
+ While multiple open-weights models with strong performance have been publicly released in the past, more often than not these releases are not accompanied by the corresponding training dataset. This is unfortunate as the dataset specificities and characteristics have been demonstrated to have a very large impact and role in the performances of the models. As the creation of a high quality training dataset is a fundamental requirement to training an LLM capable of excelling at downstream tasks, with 🍷 Spidder we (a) not only make the dataset creation process more transparent, by sharing our entire processing setup including the codebase used, we also (b) help alleviate the costs of dataset curation, both in time and in compute, for model creators by publicly releasing our dataset with the community.
297
 
298
  ### Discussion of Biases
299
 
300
+ Efforts were made to minimize the amount of NSFW and toxic content present in the dataset by employing filtering on the URL level. However, there are still a significant number of documents present in the final dataset that could be considered toxic or contain harmful content. As 🍷 Spidder was sourced from the web as a whole, any harmful biases typically present in it may be reproduced on our dataset.
301
 
302
  We deliberately avoided using machine learning filtering methods that define text quality based on the similarity to a β€œgold” source such as wikipedia or toxicity classifiers as these methods have been known to [disproportionately remove content in specific dialects](https://aclanthology.org/D16-1120/) and [overclassify as toxic text related to specific social identities](https://arxiv.org/pdf/2109.07445.pdf), respectively.
303
 
304
  ### Other Known Limitations
305
 
306
+ As a consequence of some of the filtering steps applied, it is likely that code content is not prevalent in our dataset. If you are training a model that should also perform code tasks, we recommend you use 🍷 Spidder with a code dataset, such as [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2). You should also probably consider complementing 🍷 Spidder with specialized curated sources (such as Wikipedia, for example) as they will likely have better formatting than the wikipedia content included in 🍷 Spidder (we did not tailor the processing to individual websites).
307
 
308
  ## Additional Information
309
 
 
313
 
314
  ### Future work
315
 
316
+ We plan to not only continue but also expand our efforts to create open-source high quality training datasets and to improve 🍷 Spidder itself in future iterations.
317
 
 
 
 
 
 
 
 
 
 
 
 
 
318
  ```