metadata

license: apache-2.0
datasets:
  - cerebras/SlimPajama-627B
language:
  - en

Overview

This is the repo for intermediate checkpoints for my upcoming MicroLlama V2 model with 500 million parameters based on Llama3.2. They are completed pretrained from scratch using SlmPajama-627B. This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds.

Some reasons for using these checkpoints:

You can use them starting point to train your own small language model.
More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human.

How to use these checkpoints

These checkpoints are compatible with litgpt with slight modifications (see section below).

In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required):

# Install litgpt
pip install 'litgpt[all]'

# litgpt pretrain checkpoint to inference checkpoint 
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
  --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>

# litgpt inference checkpoint to HF checkpoints
litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT>

Reference:

litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints
litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md

Advanced usage - pretraining with litgpt

For folks who are familar with litgpt, you can add the following code to your config.py to use these checkpoints to continue to train the model.

    # based on Llama-3.2-1B
    dict(
        name="micro-llama-300M-v2",
        hf_config=dict(org="keeeeenw", name="MicroLlamaV2"),
        block_size=131072,  # Stable choice for Llama model training
        # This contributes to 300M to 500M parameter increase
        # Note that we cannot change this number because the llama3
        # tokenizer is hardcoded to support this vocab size.
        vocab_size=128000,
        padded_vocab_size=128256,
        n_layer=12,
        n_embd=1024,
        n_head=16,
        n_query_groups=4,
        rotary_percentage=1.0,
        parallel_residual=False,
        bias=False,
        norm_class_name="RMSNorm",
        mlp_class_name="LLaMAMLP",
        intermediate_size=5632,
        rope_base=500000,  # Scaling for long sequence support
        # RoPE adjustments for block size of 131072
        rope_adjustments=dict(
            factor=16.0,  # Matches block_size=131072
            low_freq_factor=1.0,
            high_freq_factor=4.0,
            original_max_seq_len=8192  # Max seq length for 128K token block
        )
    ),

You will need to preprocess your data using meta-llama/Llama-3.2-1B tokenizer similar to prepare-the-tinyllama-1t-token-dataset which uses the Llama2 tokenizer.

Assuming you have litgpt installed already,

git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data

litgpt download meta-llama/Llama-3.2-1B \
   --access_token your_hf_token \
   --tokenizer_only true

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/train \
  --output_dir data/slimpajama/train \
  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B

python litgpt/data/prepare_slimpajama.py \
  --input_dir data/slimpajama-raw/validation \
  --output_dir data/slimpajama/val \
  --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B

Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores. I have tried to shared the converted data as a HF dataset, but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later.

Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml

Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3:

litgpt pretrain \
  --config microllama_v2.yaml \
  --resume <PATH_TO_CHECKPOINT>