keeeeenw
/

MicroLlama2-checkpoints

Model card Files Files and versions Community

keeeeenw commited on Feb 1

Commit

e10cf5d

·

verified ·

1 Parent(s): d5ed034

Update README.md

Files changed (1) hide show

README.md +21 -1

README.md CHANGED Viewed

@@ -108,6 +108,26 @@ Note: the config has 300M in the model name but it is actually 500M due to the v
 ```
 litgpt pretrain \
   --config microllama_v2.yaml \
-  --resume <PATH_TO_CHECKPOINT>
 ```

 ```
 litgpt pretrain \
   --config microllama_v2.yaml \
+  --resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>
 ```
+**IMPORTANT NOTE**
+I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from
+Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you
+store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to
+look for the same data under ```/cache/chunks/```.
+If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.
+```
+litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
+  --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
+litgpt pretrain \
+  --config microllama_v2.yaml \
+  --initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
+```
+You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.