Update README.md
Browse files
README.md
CHANGED
@@ -108,6 +108,26 @@ Note: the config has 300M in the model name but it is actually 500M due to the v
|
|
108 |
```
|
109 |
litgpt pretrain \
|
110 |
--config microllama_v2.yaml \
|
111 |
-
--resume <
|
112 |
```
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
```
|
109 |
litgpt pretrain \
|
110 |
--config microllama_v2.yaml \
|
111 |
+
--resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>
|
112 |
```
|
113 |
|
114 |
+
**IMPORTANT NOTE**
|
115 |
+
I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from
|
116 |
+
Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you
|
117 |
+
store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to
|
118 |
+
look for the same data under ```/cache/chunks/```.
|
119 |
+
|
120 |
+
If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.
|
121 |
+
```
|
122 |
+
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
|
123 |
+
--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
|
124 |
+
|
125 |
+
litgpt pretrain \
|
126 |
+
--config microllama_v2.yaml \
|
127 |
+
--initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
|
128 |
+
```
|
129 |
+
|
130 |
+
You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.
|
131 |
+
|
132 |
+
|
133 |
+
|