Spaces:

awacke1
/

Bloom.Big.Science.Continual.Generator

Sleeping

App Files Files Community

awacke1 commited on Feb 19, 2023

Commit

7d02872

1 Parent(s): f0ea977

Update app.py

Browse files

Files changed (1) hide show

app.py +23 -2

app.py CHANGED Viewed

@@ -39,9 +39,9 @@ French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et
 # Big Science on Papers with Code:
 https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
-# Outline of Exciting AI Developments! 🤖💻🔬
-Here is an outline of some of the most exciting recent developments in AI:
 ## Language Models 🗣️
@@ -67,6 +67,27 @@ Here is an outline of some of the most exciting recent developments in AI:
 - Toronto Books Corpus
 - OpenWebText
 ## Big Science Model 🚀
 - 📜 Papers:

 # Big Science on Papers with Code:
 https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
+# Exciting AI Developments! 🤖💻🔬
+## Here is an outline of exciting recent developments in AI:
 ## Language Models 🗣️
 - Toronto Books Corpus
 - OpenWebText
+## ChatGPT Datasets - Details 📚
+- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
+  - [WebText: A Large-Scale Unsupervised Text Corpus](https://arxiv.org/abs/1902.10197) by Radford et al.
+- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
+  - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
+- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
+  - [Scalable Methods for 8 Billion Token Language Modeling](https://arxiv.org/abs/1907.05019) by Zhu et al.
+- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
+  - [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) by Radford et al.
+- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
+  - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/abs/1812.10464) by Schwenk and Douze.
+- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
+  - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
 ## Big Science Model 🚀
 - 📜 Papers: