Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,113 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- cerebras/SlimPajama-627B
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
---
|
8 |
+
|
9 |
+
# Overview
|
10 |
+
|
11 |
+
This is the repo for intermediate checkpoints for my upcoming **MicroLlama V2** model with 500 million parameters based on **Llama3.2**.
|
12 |
+
They are completed pretrained from scratch using **SlmPajama-627B**.
|
13 |
+
This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.
|
14 |
+
|
15 |
+
Some reasons for using these checkpoints:
|
16 |
+
|
17 |
+
- You can use them starting point to train your own small language model.
|
18 |
+
- More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.
|
19 |
+
|
20 |
+
# How to use these checkpoints
|
21 |
+
|
22 |
+
These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below).
|
23 |
+
|
24 |
+
In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):
|
25 |
+
|
26 |
+
```
|
27 |
+
# Install litgpt
|
28 |
+
pip install 'litgpt[all]'
|
29 |
+
|
30 |
+
# litgpt pretrain checkpoint to inference checkpoint
|
31 |
+
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
|
32 |
+
--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
|
33 |
+
|
34 |
+
# litgpt inference checkpoint to HF checkpoints
|
35 |
+
litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>
|
36 |
+
```
|
37 |
+
|
38 |
+
Reference:
|
39 |
+
|
40 |
+
1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
|
41 |
+
2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md
|
42 |
+
|
43 |
+
|
44 |
+
# Advanced usage - pretraining with litgpt
|
45 |
+
|
46 |
+
For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model.
|
47 |
+
|
48 |
+
```python
|
49 |
+
# based on Llama-3.2-1B
|
50 |
+
dict(
|
51 |
+
name="micro-llama-300M-v2",
|
52 |
+
hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
|
53 |
+
block_size=131072, # Stable choice for Llama model training
|
54 |
+
# This contributes to 300M to 500M parameter increase
|
55 |
+
# Note that we cannot change this number because the llama3
|
56 |
+
# tokenizer is hardcoded to support this vocab size.
|
57 |
+
vocab_size=128000,
|
58 |
+
padded_vocab_size=128256,
|
59 |
+
n_layer=12,
|
60 |
+
n_embd=1024,
|
61 |
+
n_head=16,
|
62 |
+
n_query_groups=4,
|
63 |
+
rotary_percentage=1.0,
|
64 |
+
parallel_residual=False,
|
65 |
+
bias=False,
|
66 |
+
norm_class_name="RMSNorm",
|
67 |
+
mlp_class_name="LLaMAMLP",
|
68 |
+
intermediate_size=5632,
|
69 |
+
rope_base=500000, # Scaling for long sequence support
|
70 |
+
# RoPE adjustments for block size of 131072
|
71 |
+
rope_adjustments=dict(
|
72 |
+
factor=16.0, # Matches block_size=131072
|
73 |
+
low_freq_factor=1.0,
|
74 |
+
high_freq_factor=4.0,
|
75 |
+
original_max_seq_len=8192 # Max seq length for 128K token block
|
76 |
+
)
|
77 |
+
),
|
78 |
+
```
|
79 |
+
|
80 |
+
You will need to preprocess your data using **meta-llama/Llama-3.2-1B** tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer.
|
81 |
+
|
82 |
+
Assuming you have litgpt installed already,
|
83 |
+
```
|
84 |
+
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data
|
85 |
+
|
86 |
+
litgpt download meta-llama/Llama-3.2-1B \
|
87 |
+
--access_token your_hf_token \
|
88 |
+
--tokenizer_only true
|
89 |
+
|
90 |
+
python litgpt/data/prepare_slimpajama.py \
|
91 |
+
--input_dir data/slimpajama-raw/train \
|
92 |
+
--output_dir data/slimpajama/train \
|
93 |
+
--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
|
94 |
+
|
95 |
+
python litgpt/data/prepare_slimpajama.py \
|
96 |
+
--input_dir data/slimpajama-raw/validation \
|
97 |
+
--output_dir data/slimpajama/val \
|
98 |
+
--tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
|
99 |
+
```
|
100 |
+
|
101 |
+
Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores.
|
102 |
+
I have tried to shared the converted data as a HF dataset,
|
103 |
+
but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.
|
104 |
+
|
105 |
+
Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml
|
106 |
+
|
107 |
+
Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:
|
108 |
+
```
|
109 |
+
litgpt pretrain \
|
110 |
+
--config microllama_v2.yaml \
|
111 |
+
--resume <PATH_TO_CHECKPOINT>
|
112 |
+
```
|
113 |
+
|