grandiose-pizza
commited on
Commit
•
53c6e82
1
Parent(s):
9dd3950
Update README.md
Browse files
README.md
CHANGED
@@ -72,7 +72,7 @@ Below is sample code to use the model. Note that the model requires a custom mod
|
|
72 |
import torch
|
73 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
74 |
|
75 |
-
model_path = "inceptionai/
|
76 |
|
77 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
78 |
|
@@ -157,20 +157,6 @@ During the adapted pre-training of the (`jais-adapted-*`) models, we first initi
|
|
157 |
|
158 |
During instruction tuning, each training example consists of a single-turn or multi-turn prompt and it's response. Instead of one example per sequence, examples are packed together while the loss is masked on the prompt tokens. This approach speeds up training by allowing more examples to be processed per batch.
|
159 |
|
160 |
-
|
161 |
-
### Training Hyperparameters:
|
162 |
-
|
163 |
-
#### Jais-family-30b-16k
|
164 |
-
| Hyperparameter | Value |
|
165 |
-
|----------------|-------------------------------------------|
|
166 |
-
| Precision | fp32 |
|
167 |
-
| Optimizer | AdamW |
|
168 |
-
| Learning rate | 0 to 0.012(<=69 warmup steps)<br>0.012 to 0.00231(>69 and <=137273 steps)<br>0.00231 to 0.00048(>137273 and <= 260648 steps)<br>0.00048 to 0.000048(>260648 and <=287032 steps)|
|
169 |
-
| Weight decay | 0.1 |
|
170 |
-
| Batch size | 2664(<=137273 steps)<br>748(>137273 and <= 260648 steps)<br>384(>260648 and <=287032 steps)|
|
171 |
-
| Context Length | 2048(<=137273 steps)<br>8192(>137273 and <= 260648 steps)<br>16384(>260648 and <=287032 steps)|
|
172 |
-
| Steps | 287032 |
|
173 |
-
|
174 |
### Compute Infrastructure
|
175 |
|
176 |
The training process was performed on the Condor Galaxy (CG) supercomputer platform. A CG contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM, and achieves a total of 960 PetaFLOP/s.
|
|
|
72 |
import torch
|
73 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
74 |
|
75 |
+
model_path = "inceptionai/Jais-family-256m"
|
76 |
|
77 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
78 |
|
|
|
157 |
|
158 |
During instruction tuning, each training example consists of a single-turn or multi-turn prompt and it's response. Instead of one example per sequence, examples are packed together while the loss is masked on the prompt tokens. This approach speeds up training by allowing more examples to be processed per batch.
|
159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
160 |
### Compute Infrastructure
|
161 |
|
162 |
The training process was performed on the Condor Galaxy (CG) supercomputer platform. A CG contains 64 Cerebras CS-2 Wafer-Scale Engines (WSE-2) with 40 GB of SRAM, and achieves a total of 960 PetaFLOP/s.
|