PatrickHaller commited on
Commit
f6915d5
·
verified ·
1 Parent(s): 7917bea

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - PatrickHaller/dsir-pile-100M-words
4
+ ---
5
+
6
+
7
+ # Description
8
+
9
+ This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset.
10
+ We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.
11
+
12
+ The subset sample distribution is:
13
+
14
+ ```json
15
+ {
16
+ 'Pile-CC': 198245,
17
+ 'OpenWebText2': 122382,
18
+ 'FreeLaw': 37517,
19
+ 'USPTO Backgrounds': 10195,
20
+ 'Wikipedia (en)': 8072,
21
+ 'PubMed Central': 5849,
22
+ 'PubMed Abstracts': 4965,
23
+ 'Gutenberg (PG-19)': 2712,
24
+ 'BookCorpus2': 2550,
25
+ 'Books3': 2432,
26
+ 'StackExchange': 1753,
27
+ 'PhilPapers': 1560,
28
+ 'YoutubeSubtitles': 1187,
29
+ 'OpenSubtitles': 1015,
30
+ 'ArXiv': 610,
31
+ 'NIH ExPorter': 476,
32
+ 'Enron Emails': 439,
33
+ 'EuroParl': 419,
34
+ 'Github': 390,
35
+ 'HackerNews': 259
36
+ }
37
+ ```
38
+
39
+
40
+ The dataset contains ~100M words of text. This can be checked with:
41
+
42
+ ```python
43
+ from datasets import load_dataset
44
+
45
+ ds = load_dataset("PatrickHaller/dsir-pile-100M-words")
46
+
47
+ count = 0
48
+ for row in ds["train"]:
49
+ count += len(row["text"].split(" "))
50
+
51
+ print(count)
52
+
53
+ # Out: 99999861
54
+ ```