Text Generation
Transformers
PyTorch
English
gpt2
causal-lm
text-generation-inference
Inference Endpoints
rskuzma commited on
Commit
9e164f3
1 Parent(s): c798cb7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -37
README.md CHANGED
@@ -11,7 +11,7 @@ pipeline_tag: text-generation
11
  ---
12
 
13
  # Cerebras-GPT 256M
14
- [TODO: arXiv paper](https://www.cerebras.net), [TODO: Blog Post](https://www.cerebras.net)
15
 
16
  ## Model Description
17
 
@@ -39,7 +39,7 @@ Cerebras systems for pre-training and fine tuning are available in the cloud via
39
  * Language: English
40
  * Learn more: Dense Scaling Laws Paper for training procedure, config files, and details on how to use.
41
 
42
- **Contact**: To ask questions about Cerebras-GPT models, join the Cerebras Discord, and post them in **#scaling-laws-release.**
43
 
44
  This is the standard parameterization version of Cerebras-GPT with **256M** parameters
45
 
@@ -49,13 +49,13 @@ Related models: [Cerebras-GPT Models](https://huggingface.co/models?sort=downloa
49
 
50
  | Model | Parameters | Layers | d_model | Heads | d_head | d_ffn | LR | BS (seq) | BS (tokens) |
51
  |---------------|------------|--------|---------|-------|--------|--------|----------|----------|----------------|
52
- | Cerebras-GPT | 111M | 10 | 768 | 12 | 64 | 3072 | 6.00E-04 | 120 | 246K |
53
- | Cerebras-GPT | 256M | 14 | 1088 | 17 | 64 | 4352 | 6.00E-04 | 264 | 541K |
54
- | Cerebras-GPT | 590M | 18 | 1536 | 12 | 128 | 6144 | 2.00E-04 | 264 | 541K |
55
- | Cerebras-GPT | 1.3B | 24 | 2048 | 16 | 128 | 8192 | 2.00E-04 | 528 | 1.08M |
56
- | Cerebras-GPT | 2.7B | 32 | 2560 | 20 | 128 | 10240 | 2.00E-04 | 528 | 1.08M |
57
- | Cerebras-GPT | 6.7B | 32 | 4096 | 32 | 128 | 16384 | 1.20E-04 | 1040 | 2.13M |
58
- | Cerebras-GPT | 13B | 40 | 5120 | 40 | 128 | 20480 | 1.20E-04 | 720/1080 | 1.47M/2.21M |
59
 
60
  <br><br>
61
 
@@ -65,8 +65,8 @@ This model can be easily loaded using the AutoModelForCausalLM functionality:
65
  ```python
66
  from transformers import AutoTokenizer, AutoModelForCausalLM
67
 
68
- tokenizer = AutoTokenizer.from_pretrained("Cerebras/Cerebras-GPT-256M")
69
- model = AutoModelForCausalLM.from_pretrained("Cerebras/Cerebras-GPT-256M")
70
 
71
  text = "Generative AI is "
72
  ```
@@ -95,17 +95,17 @@ print(text_output[0])
95
 
96
  ## Training data
97
 
98
- Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology.
99
 
100
- Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
101
 
102
- Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a vocabulary size of 50257, and a maximum sequence length of 2048. We include more details about the training dataset preprocessing in Appendix B.1 of [TODO: our paper](https://www.cerebras.net).
103
 
104
  <br><br>
105
 
106
  ## Training procedure
107
 
108
- We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.
109
 
110
  All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
111
 
@@ -125,7 +125,10 @@ Model Params | Sequence Length | Batch Size | Number of Steps | Tokens | Tokens
125
 
126
  ## Evaluations
127
 
128
- We evaluate our models on the PILE validation set comprising 380M tokens. In our paper we also evaluate the public checkpoints of Pythia, Eleuther (2022); OPT, Zhang et al. (2022); GPT-NeoX 20B, Black et al. (2022); and GPT-J 6B, Wang & Komatsuzaki (2021). We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
 
 
 
129
 
130
  #### 0-shot Evaluation
131
  | Model | Params | Training FLOPs | PILE test xent | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
@@ -135,7 +138,7 @@ We evaluate our models on the PILE validation set comprising 380M tokens. In our
135
  | Cerebras-GPT | 590M | 5.3E+19 | 2.184 | 0.291 | 0.627 | 0.498 | 0.366 | 0.464 | 0.190 | 0.158 | 0.370 |
136
  | Cerebras-GPT | 1.3B | 2.5E+20 | 1.996 | 0.325 | 0.664 | 0.521 | 0.462 | 0.508 | 0.224 | 0.166 | 0.410 |
137
  | Cerebras-GPT | 2.7B | 9.8E+20 | 1.834 | 0.386 | 0.701 | 0.559 | 0.567 | 0.571 | 0.246 | 0.206 | 0.462 |
138
- | Cerebras-GPT | 6.7B | 5.9E+21 | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
139
  | Cerebras-GPT | 13B | 2.1E+22 | 1.575 | 0.513 | 0.766 | 0.646 | 0.696 | 0.714 | 0.367 | 0.286 | 0.570 |
140
 
141
  #### 5-shot Evaluation
@@ -146,7 +149,7 @@ We evaluate our models on the PILE validation set comprising 380M tokens. In our
146
  | Cerebras-GPT | 590M | 0.291 | 0.634 | 0.479 | 0.281 | 0.475 | 0.206 | 0.152 |
147
  | Cerebras-GPT | 1.3B | 0.326 | 0.668 | 0.536 | 0.395 | 0.529 | 0.241 | 0.174 |
148
  | Cerebras-GPT | 2.7B | 0.382 | 0.697 | 0.543 | 0.487 | 0.590 | 0.267 | 0.224 |
149
- | Cerebras-GPT | 6.7B | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
150
  | Cerebras-GPT | 13B | 0.514 | 0.768 | 0.674 | 0.655 | 0.743 | 0.398 | 0.318 |
151
 
152
 
@@ -155,36 +158,25 @@ We evaluate our models on the PILE validation set comprising 380M tokens. In our
155
  ## Uses and Limitations
156
 
157
  ### Intended Use
158
- The models we train are being open-sourced to further research into LLM scaling laws, but we release these models with a fully permissive Apache license for the community to use freely.
159
 
160
- You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
161
 
 
162
 
163
  ### Out of Scope Use
164
  Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
165
 
166
  Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
167
 
168
- ### Risk and Bias
169
- Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.
 
 
 
170
 
171
  <br><br>
172
 
173
- ## Citation and Related Information
174
-
175
- ### BibTeX entry
176
-
177
- To cite this model:
178
- ```bibtex
179
- @misc{Cerebras-GPT,
180
- author = {Nolan Dey and Gurpreet Gosal and Charles Chen and Hemant Khachane and Ribhu Pathria and William Marshall and Marvin Tom and Joel Hestness},
181
- title = {GPT-3 Scaling Laws for the PILE Dataset, Trained on the Cerebras Wafer-Scale Engine},
182
- year = {2023},
183
- month = {March},
184
- howpublished = {\url{https://www.cerebras.net/TODO/dense-scaling-laws/TODO}}
185
- }
186
- ```
187
-
188
  ## Acknowledgements
189
 
190
  We are thankful to all Cerebras engineers, past and present, that made this work possible.
 
11
  ---
12
 
13
  # Cerebras-GPT 256M
14
+ Check out our [Blog Post](https://www.cerebras.net/cerebras-gpt). Our arXiv paper is coming soon!
15
 
16
  ## Model Description
17
 
 
39
  * Language: English
40
  * Learn more: Dense Scaling Laws Paper for training procedure, config files, and details on how to use.
41
 
42
+ **Contact**: To ask questions about Cerebras-GPT models, join the [Cerebras Discord](https://discord.com/invite/ZZf3Q2wc), and post them in **#scaling-laws.**
43
 
44
  This is the standard parameterization version of Cerebras-GPT with **256M** parameters
45
 
 
49
 
50
  | Model | Parameters | Layers | d_model | Heads | d_head | d_ffn | LR | BS (seq) | BS (tokens) |
51
  |---------------|------------|--------|---------|-------|--------|--------|----------|----------|----------------|
52
+ | Cerebras-GPT | 111M | 10 | 768 | 12 | 64 | 3072 | 6.0E-04 | 120 | 246K |
53
+ | Cerebras-GPT | 256M | 14 | 1088 | 17 | 64 | 4352 | 6.0E-04 | 264 | 541K |
54
+ | Cerebras-GPT | 590M | 18 | 1536 | 12 | 128 | 6144 | 2.0E-04 | 264 | 541K |
55
+ | Cerebras-GPT | 1.3B | 24 | 2048 | 16 | 128 | 8192 | 2.0E-04 | 528 | 1.08M |
56
+ | Cerebras-GPT | 2.7B | 32 | 2560 | 20 | 128 | 10240 | 2.0E-04 | 528 | 1.08M |
57
+ | Cerebras-GPT | 6.7B | 32 | 4096 | 32 | 128 | 16384 | 1.2E-04 | 1040 | 2.13M |
58
+ | Cerebras-GPT | 13B | 40 | 5120 | 40 | 128 | 20480 | 1.2E-04 | 720 &rarr; 1080 | 1.47M &rarr; 2.21M |
59
 
60
  <br><br>
61
 
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModelForCausalLM
67
 
68
+ tokenizer = AutoTokenizer.from_pretrained("cerebras/Cerebras-GPT-256M")
69
+ model = AutoModelForCausalLM.from_pretrained("cerebras/Cerebras-GPT-256M")
70
 
71
  text = "Generative AI is "
72
  ```
 
95
 
96
  ## Training data
97
 
98
+ Cerebras-GPT is trained using [the Pile](https://pile.eleuther.ai) dataset from [EleutherAI](https://www.eleuther.ai). See the [Pile paper](https://arxiv.org/abs/2101.00027) for a more detailed breakdown of data sources and methodology. The Pile was cleaned using the ftfy library to normalize the text, then filtered using scripts provided by Eleuther.
99
 
100
+ We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix B.1 of our paper.
101
 
102
+ Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the total token count by 33%. Our models are trained on the Pile **without deduplication**, which presents an opportunity for further improvement with the deduplicated data set.
103
 
104
  <br><br>
105
 
106
  ## Training procedure
107
 
108
+ We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
109
 
110
  All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
111
 
 
125
 
126
  ## Evaluations
127
 
128
+ We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the training run was going well.
129
+
130
+ We performed upstream (pre-training) evaluations of text prediction cross-entropy using the Pile validation and test splits. We performed downstream evaluations of text generation accuracy on standardized tasks using the [Eleuther lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). Results are compared against many publicly available large language models in Section 3 of the paper.
131
+
132
 
133
  #### 0-shot Evaluation
134
  | Model | Params | Training FLOPs | PILE test xent | Hella-Swag | PIQA | Wino-Grande | Lambada | ARC-e | ARC-c | OpenBookQA | Downstream Average |
 
138
  | Cerebras-GPT | 590M | 5.3E+19 | 2.184 | 0.291 | 0.627 | 0.498 | 0.366 | 0.464 | 0.190 | 0.158 | 0.370 |
139
  | Cerebras-GPT | 1.3B | 2.5E+20 | 1.996 | 0.325 | 0.664 | 0.521 | 0.462 | 0.508 | 0.224 | 0.166 | 0.410 |
140
  | Cerebras-GPT | 2.7B | 9.8E+20 | 1.834 | 0.386 | 0.701 | 0.559 | 0.567 | 0.571 | 0.246 | 0.206 | 0.462 |
141
+ | Cerebras-GPT | 6.7B | 5.9E+21 | 1.704 | 0.447 | 0.739 | 0.602 | 0.636 | 0.643 | 0.282 | 0.238 | 0.512 |
142
  | Cerebras-GPT | 13B | 2.1E+22 | 1.575 | 0.513 | 0.766 | 0.646 | 0.696 | 0.714 | 0.367 | 0.286 | 0.570 |
143
 
144
  #### 5-shot Evaluation
 
149
  | Cerebras-GPT | 590M | 0.291 | 0.634 | 0.479 | 0.281 | 0.475 | 0.206 | 0.152 |
150
  | Cerebras-GPT | 1.3B | 0.326 | 0.668 | 0.536 | 0.395 | 0.529 | 0.241 | 0.174 |
151
  | Cerebras-GPT | 2.7B | 0.382 | 0.697 | 0.543 | 0.487 | 0.590 | 0.267 | 0.224 |
152
+ | Cerebras-GPT | 6.7B | 0.444 | 0.736 | 0.590 | 0.591 | 0.667 | 0.314 | 0.270 |
153
  | Cerebras-GPT | 13B | 0.514 | 0.768 | 0.674 | 0.655 | 0.743 | 0.398 | 0.318 |
154
 
155
 
 
158
  ## Uses and Limitations
159
 
160
  ### Intended Use
161
+ The primary intended use is to further research into large language models. These models can be used as a foundation model for NLP, applications, ethics, and alignment research. Our primary intended users are researchers who are working to improve LLMs and practitioners seeking reference implementations, training setups, hyperparameters, or pre-trained models. We release these models with a fully permissive Apache license for the community to use freely.
162
 
163
+ You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied beore using the Cerebras-GPT model family in production downstream applications.
164
 
165
+ Due to financial and compute budgets, Cerebras-GPT models were only trained and evaluated following the approaches described in the paper.
166
 
167
  ### Out of Scope Use
168
  Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
169
 
170
  Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
171
 
172
+ ### Risk, Bias, Ethical Considerations
173
+ * **Data**: The Pile dataset has been thoroughly analyzed from various ethical standpoints such as toxicity analysis, gender bias, pejorative content, racially sensitive content etc. Please refer to Pile dataset references.
174
+ * **Human life**: The outputs from this model may or may not align with human values. The risk needs to be thoroughly investigated before deploying this model in a production environment where it can directly impact human life.
175
+ * **Risks and harms**: There can be distributional bias in the Pile dataset that can manifest in various forms in the downstream model deployment. There are other risks associated with large language models such as amplifying stereotypes, memorizing training data, or revealing private or secure information.
176
+ * **Mitigations**: Only mitigations in standard Pile dataset pre-processing were employed when pre-training Cerebras-GPT.
177
 
178
  <br><br>
179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ## Acknowledgements
181
 
182
  We are thankful to all Cerebras engineers, past and present, that made this work possible.