keeeeenw commited on
Commit
d5ed034
·
verified ·
1 Parent(s): 63ae844

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - cerebras/SlimPajama-627B
5
+ language:
6
+ - en
7
+ ---
8
+
9
+ # Overview
10
+
11
+ This is the repo for intermediate checkpoints for my upcoming **MicroLlama V2** model with 500 million parameters based on **Llama3.2**.
12
+ They are completed pretrained from scratch using **SlmPajama-627B**.
13
+ This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.
14
+
15
+ Some reasons for using these checkpoints:
16
+
17
+ - You can use them starting point to train your own small language model.
18
+ - More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.
19
+
20
+ # How to use these checkpoints
21
+
22
+ These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below).
23
+
24
+ In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):
25
+
26
+ ```
27
+ # Install litgpt
28
+ pip install 'litgpt[all]'
29
+
30
+ # litgpt pretrain checkpoint to inference checkpoint
31
+ litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
32
+ --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
33
+
34
+ # litgpt inference checkpoint to HF checkpoints
35
+ litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>
36
+ ```
37
+
38
+ Reference:
39
+
40
+ 1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
41
+ 2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md
42
+
43
+
44
+ # Advanced usage - pretraining with litgpt
45
+
46
+ For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model.
47
+
48
+ ```python
49
+ # based on Llama-3.2-1B
50
+ dict(
51
+ name="micro-llama-300M-v2",
52
+ hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
53
+ block_size=131072, # Stable choice for Llama model training
54
+ # This contributes to 300M to 500M parameter increase
55
+ # Note that we cannot change this number because the llama3
56
+ # tokenizer is hardcoded to support this vocab size.
57
+ vocab_size=128000,
58
+ padded_vocab_size=128256,
59
+ n_layer=12,
60
+ n_embd=1024,
61
+ n_head=16,
62
+ n_query_groups=4,
63
+ rotary_percentage=1.0,
64
+ parallel_residual=False,
65
+ bias=False,
66
+ norm_class_name="RMSNorm",
67
+ mlp_class_name="LLaMAMLP",
68
+ intermediate_size=5632,
69
+ rope_base=500000, # Scaling for long sequence support
70
+ # RoPE adjustments for block size of 131072
71
+ rope_adjustments=dict(
72
+ factor=16.0, # Matches block_size=131072
73
+ low_freq_factor=1.0,
74
+ high_freq_factor=4.0,
75
+ original_max_seq_len=8192 # Max seq length for 128K token block
76
+ )
77
+ ),
78
+ ```
79
+
80
+ You will need to preprocess your data using **meta-llama/Llama-3.2-1B** tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer.
81
+
82
+ Assuming you have litgpt installed already,
83
+ ```
84
+ git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data
85
+
86
+ litgpt download meta-llama/Llama-3.2-1B \
87
+ --access_token your_hf_token \
88
+ --tokenizer_only true
89
+
90
+ python litgpt/data/prepare_slimpajama.py \
91
+ --input_dir data/slimpajama-raw/train \
92
+ --output_dir data/slimpajama/train \
93
+ --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
94
+
95
+ python litgpt/data/prepare_slimpajama.py \
96
+ --input_dir data/slimpajama-raw/validation \
97
+ --output_dir data/slimpajama/val \
98
+ --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
99
+ ```
100
+
101
+ Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores.
102
+ I have tried to shared the converted data as a HF dataset,
103
+ but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.
104
+
105
+ Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml
106
+
107
+ Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:
108
+ ```
109
+ litgpt pretrain \
110
+ --config microllama_v2.yaml \
111
+ --resume <PATH_TO_CHECKPOINT>
112
+ ```
113
+