Update README.md
Browse files
README.md
CHANGED
@@ -14,47 +14,6 @@ configs:
|
|
14 |
path: data/*/*
|
15 |
- config_name: sample-10BT
|
16 |
---
|
17 |
-
# π· Spidder
|
18 |
-
<center>
|
19 |
-
<img src="https://huggingface.co/datasets/cvedb/admin/resolve/main/spidder-logo.png" alt="Spidder: The finest collection of data the web has to offer">
|
20 |
-
</center>
|
21 |
-
|
22 |
-
> 15 trillion tokens of the finest data the π web has to offer
|
23 |
-
|
24 |
-
# Table of Contents
|
25 |
-
- [π· Spidder](#-spidder)
|
26 |
-
* [What is it?](#what-is-it)
|
27 |
-
* [What is being released?](#what-is-being-released)
|
28 |
-
* [Changelog](#changelog)
|
29 |
-
* [How to download and use π· Spidder](#how-to-download-and-use-π·-spidder)
|
30 |
-
+ [Using π `datatrove`](#using-datatrove)
|
31 |
-
+ [Using `huggingface_hub`](#using-huggingface_hub)
|
32 |
-
+ [Using `datasets`](#using-datasets)
|
33 |
-
* [Breakdown by dump/crawl](#breakdown-by-dumpcrawl)
|
34 |
-
* [Dataset performance evaluation and ablations](#dataset-performance-evaluation-and-ablations)
|
35 |
-
+ [Hyper-parameters for ablation models](#hyper-parameters-for-ablation-models)
|
36 |
-
+ [Ablation evaluation benchmarks](#ablation-evaluation-benchmarks)
|
37 |
-
+ [Comparison with other datasets](#comparison-with-other-datasets)
|
38 |
-
- [Dataset card for π· Spidder](#dataset-card-for-π·-spidder)
|
39 |
-
* [Dataset Summary](#dataset-summary)
|
40 |
-
* [Dataset Structure](#dataset-structure)
|
41 |
-
+ [Data Instances](#data-instances)
|
42 |
-
+ [Data Fields](#data-fields)
|
43 |
-
+ [Data Splits](#data-splits)
|
44 |
-
* [Dataset Creation](#dataset-creation)
|
45 |
-
+ [Curation Rationale](#curation-rationale)
|
46 |
-
+ [Source Data](#source-data)
|
47 |
-
+ [Data processing steps](#data-processing-steps)
|
48 |
-
+ [Annotations](#annotations)
|
49 |
-
+ [Personal and Sensitive Information](#personal-and-sensitive-information)
|
50 |
-
* [Considerations for Using the Data](#considerations-for-using-the-data)
|
51 |
-
+ [Social Impact of Dataset](#social-impact-of-dataset)
|
52 |
-
+ [Discussion of Biases](#discussion-of-biases)
|
53 |
-
+ [Other Known Limitations](#other-known-limitations)
|
54 |
-
* [Additional Information](#additional-information)
|
55 |
-
+ [Licensing Information](#licensing-information)
|
56 |
-
+ [Future work](#future-work)
|
57 |
-
+ [Citation Information](#citation-information)
|
58 |
|
59 |
## What is it?
|
60 |
|
|
|
14 |
path: data/*/*
|
15 |
- config_name: sample-10BT
|
16 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
## What is it?
|
19 |
|