cerebras
/

Cerebras-GPT-256M

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

rskuzma commited on Mar 30, 2023

Commit

92311c7

•

1 Parent(s): d92fd27

fixed typo

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -107,7 +107,7 @@ Recent works find significant duplicate data present in the Pile. Eleuther’s P
 We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
-All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
 <br>

 We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.
+All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048). See Training Table, below, for details.
 <br>