keeeeenw
/

MicroLlama2-checkpoints

English

llama

Model card Files Files and versions Community

keeeeenw commited on Feb 1

Commit

d5ed034

verified ·

1 Parent(s): 63ae844

Update README.md

Browse files

Files changed (1) hide show

README.md +113 -3

README.md CHANGED Viewed

@@ -1,3 +1,113 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- cerebras/SlimPajama-627B
+language:
+- en
+---
+# Overview
+This is the repo for intermediate checkpoints for my upcoming **MicroLlama V2** model with 500 million parameters based on **Llama3.2**.
+They are completed pretrained from scratch using **SlmPajama-627B**.
+This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.
+Some reasons for using these checkpoints:
+- You can use them starting point to train your own small language model.
+- More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.
+# How to use these checkpoints
+These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below).
+In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):
+```
+# Install litgpt
+pip install 'litgpt[all]'
+# litgpt pretrain checkpoint to inference checkpoint
+litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
+  --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
+# litgpt inference checkpoint to HF checkpoints
+litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>
+```
+Reference:
+1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
+2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md
+# Advanced usage - pretraining with litgpt
+For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model.
+```python
+    # based on Llama-3.2-1B
+    dict(
+        name="micro-llama-300M-v2",
+        hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
+        block_size=131072,  # Stable choice for Llama model training
+        # This contributes to 300M to 500M parameter increase
+        # Note that we cannot change this number because the llama3
+        # tokenizer is hardcoded to support this vocab size.
+        vocab_size=128000,
+        padded_vocab_size=128256,
+        n_layer=12,
+        n_embd=1024,
+        n_head=16,
+        n_query_groups=4,
+        rotary_percentage=1.0,
+        parallel_residual=False,
+        bias=False,
+        norm_class_name="RMSNorm",
+        mlp_class_name="LLaMAMLP",
+        intermediate_size=5632,
+        rope_base=500000,  # Scaling for long sequence support
+        # RoPE adjustments for block size of 131072
+        rope_adjustments=dict(
+            factor=16.0,  # Matches block_size=131072
+            low_freq_factor=1.0,
+            high_freq_factor=4.0,
+            original_max_seq_len=8192  # Max seq length for 128K token block
+        )
+    ),
+```
+You will need to preprocess your data using **meta-llama/Llama-3.2-1B** tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer.
+Assuming you have litgpt installed already,
+```
+git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data
+litgpt download meta-llama/Llama-3.2-1B \
+   --access_token your_hf_token \
+   --tokenizer_only true
+python litgpt/data/prepare_slimpajama.py \
+  --input_dir data/slimpajama-raw/train \
+  --output_dir data/slimpajama/train \
+  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
+python litgpt/data/prepare_slimpajama.py \
+  --input_dir data/slimpajama-raw/validation \
+  --output_dir data/slimpajama/val \
+  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B
+```
+Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores.
+I have tried to shared the converted data as a HF dataset,
+but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.
+Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml
+Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:
+```
+litgpt pretrain \
+  --config microllama_v2.yaml \
+  --resume <PATH_TO_CHECKPOINT>
+```