Update app.py
Browse files
app.py
CHANGED
@@ -39,9 +39,9 @@ French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et
|
|
39 |
# Big Science on Papers with Code:
|
40 |
https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
|
41 |
|
42 |
-
#
|
43 |
|
44 |
-
Here is an outline of
|
45 |
|
46 |
## Language Models π£οΈ
|
47 |
|
@@ -67,6 +67,27 @@ Here is an outline of some of the most exciting recent developments in AI:
|
|
67 |
- Toronto Books Corpus
|
68 |
- OpenWebText
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
## Big Science Model π
|
71 |
|
72 |
- π Papers:
|
|
|
39 |
# Big Science on Papers with Code:
|
40 |
https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
|
41 |
|
42 |
+
# Exciting AI Developments! π€π»π¬
|
43 |
|
44 |
+
## Here is an outline of exciting recent developments in AI:
|
45 |
|
46 |
## Language Models π£οΈ
|
47 |
|
|
|
67 |
- Toronto Books Corpus
|
68 |
- OpenWebText
|
69 |
|
70 |
+
## ChatGPT Datasets - Details π
|
71 |
+
|
72 |
+
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
|
73 |
+
- [WebText: A Large-Scale Unsupervised Text Corpus](https://arxiv.org/abs/1902.10197) by Radford et al.
|
74 |
+
|
75 |
+
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
|
76 |
+
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
|
77 |
+
|
78 |
+
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
|
79 |
+
- [Scalable Methods for 8 Billion Token Language Modeling](https://arxiv.org/abs/1907.05019) by Zhu et al.
|
80 |
+
|
81 |
+
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
|
82 |
+
- [Improving Language Understanding by Generative Pre-Training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) by Radford et al.
|
83 |
+
|
84 |
+
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
|
85 |
+
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://arxiv.org/abs/1812.10464) by Schwenk and Douze.
|
86 |
+
|
87 |
+
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
|
88 |
+
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) by Brown et al.
|
89 |
+
|
90 |
+
|
91 |
## Big Science Model π
|
92 |
|
93 |
- π Papers:
|