---
datasets:
- PatrickHaller/dsir-pile-100M-words
---


# Description

This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset. 
We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.

The subset sample distribution is:

```json
{
   'Pile-CC': 198245,
   'OpenWebText2': 122382,
   'FreeLaw': 37517,
   'USPTO Backgrounds': 10195,
   'Wikipedia (en)': 8072,
   'PubMed Central': 5849,
   'PubMed Abstracts': 4965,
   'Gutenberg (PG-19)': 2712,
   'BookCorpus2': 2550,
   'Books3': 2432,
   'StackExchange': 1753,
   'PhilPapers': 1560,
   'YoutubeSubtitles': 1187,
   'OpenSubtitles': 1015,
   'ArXiv': 610,
   'NIH ExPorter': 476,
   'Enron Emails': 439,
   'EuroParl': 419,
   'Github': 390,
   'HackerNews': 259
}
```


The dataset contains ~100M words of text. This can be checked with:

```python
from datasets import load_dataset

ds = load_dataset("PatrickHaller/dsir-pile-100M-words")

count = 0
for row in ds["train"]:
    count += len(row["text"].split(" "))

print(count)

# Out: 99999861
```