--- datasets: - PatrickHaller/dsir-pile-100M-words --- # Description This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset. We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling. The subset sample distribution is: ```json { 'Pile-CC': 198245, 'OpenWebText2': 122382, 'FreeLaw': 37517, 'USPTO Backgrounds': 10195, 'Wikipedia (en)': 8072, 'PubMed Central': 5849, 'PubMed Abstracts': 4965, 'Gutenberg (PG-19)': 2712, 'BookCorpus2': 2550, 'Books3': 2432, 'StackExchange': 1753, 'PhilPapers': 1560, 'YoutubeSubtitles': 1187, 'OpenSubtitles': 1015, 'ArXiv': 610, 'NIH ExPorter': 476, 'Enron Emails': 439, 'EuroParl': 419, 'Github': 390, 'HackerNews': 259 } ``` The dataset contains ~100M words of text. This can be checked with: ```python from datasets import load_dataset ds = load_dataset("PatrickHaller/dsir-pile-100M-words") count = 0 for row in ds["train"]: count += len(row["text"].split(" ")) print(count) # Out: 99999861 ```