PatrickHaller
/

hgrn2_pile_100m_distill_babylm

Text Generation

Model card Files Files and versions Community

PatrickHaller commited on Sep 12, 2024

Commit

f6915d5

·

verified ·

1 Parent(s): 7917bea

Create README.md

Files changed (1) hide show

README.md +54 -0

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+---
+datasets:
+- PatrickHaller/dsir-pile-100M-words
+---
+# Description
+This dataset is a sampled subset of the [Pile](https://huggingface.co/datasets/EleutherAI/pile) dataset.
+We used [DSIR](https://github.com/p-lambda/dsir) a data selection tool with importance resampling for subsampling.
+The subset sample distribution is:
+```json
+{
+   'Pile-CC': 198245,
+   'OpenWebText2': 122382,
+   'FreeLaw': 37517,
+   'USPTO Backgrounds': 10195,
+   'Wikipedia (en)': 8072,
+   'PubMed Central': 5849,
+   'PubMed Abstracts': 4965,
+   'Gutenberg (PG-19)': 2712,
+   'BookCorpus2': 2550,
+   'Books3': 2432,
+   'StackExchange': 1753,
+   'PhilPapers': 1560,
+   'YoutubeSubtitles': 1187,
+   'OpenSubtitles': 1015,
+   'ArXiv': 610,
+   'NIH ExPorter': 476,
+   'Enron Emails': 439,
+   'EuroParl': 419,
+   'Github': 390,
+   'HackerNews': 259
+}
+```
+The dataset contains ~100M words of text. This can be checked with:
+```python
+from datasets import load_dataset
+ds = load_dataset("PatrickHaller/dsir-pile-100M-words")
+count = 0
+for row in ds["train"]:
+    count += len(row["text"].split(" "))
+print(count)
+# Out: 99999861
+```