keeeeenw commited on
Commit
e10cf5d
·
verified ·
1 Parent(s): d5ed034

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -108,6 +108,26 @@ Note: the config has 300M in the model name but it is actually 500M due to the v
108
  ```
109
  litgpt pretrain \
110
  --config microllama_v2.yaml \
111
- --resume <PATH_TO_CHECKPOINT>
112
  ```
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  ```
109
  litgpt pretrain \
110
  --config microllama_v2.yaml \
111
+ --resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>
112
  ```
113
 
114
+ **IMPORTANT NOTE**
115
+ I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from
116
+ Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you
117
+ store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to
118
+ look for the same data under ```/cache/chunks/```.
119
+
120
+ If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.
121
+ ```
122
+ litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
123
+ --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
124
+
125
+ litgpt pretrain \
126
+ --config microllama_v2.yaml \
127
+ --initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
128
+ ```
129
+
130
+ You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.
131
+
132
+
133
+