What to put in CHECKPOINT_PATH?
When specifying my checkpoint as meditron
what should I be putting in the associated directory, as defined and replaced here:
CHECKPOINTS = {
("pmc", 7): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llamaPMC-7b-tp4-pp1",
("baseline", 7): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llama2-7b-tp4-pp1",
("baseline", 70): "/pure-mlo-scratch/alhernan/megatron-data/checkpoints/llama2-70b-tp8-pp8",
("meditron", 7): "/pure-mlo-scratch/trial-runs/meditron-7b/checkpoints/llama2-7b-tp4-pp1",
("meditron", 70): "/pure-mlo-scratch/trial-runs/meditron-70b/checkpoints/llama2-70b-tp8-pp8"
}
For context, here are the errors I'm running into, which I am guessing are related to my checkpoint format. Everything appears to run correction without errors up until this point:
Done! Now finalizing.
Status path: /workspace/summarization/data/scratch/alhernan/megatron-data/checkpoints/instructed/llama-2-7b-tp4-pp1-meditron-summarization-seq4096/.status.txt
Trying to infer the number of documents in the dataset
Settings:
RANK=0
ADDR=localhost
N_NODES=1
DATA_ARGS=--train_data_path /workspace/data/scratch/zechen/meditron/benchmarks/ft_preprocessed/tokenized/summarization/summarization --valid_data_path /workspace/data/scratch/zechen/meditron/benchmarks/ft_preprocessed/tokenized/summarization-val/summarization-val
CHECKPOINT_PATH=/workspace/data/scratch/trial-runs/meditron-7b/checkpoints/llama2-7b-tp4-pp1
TRAINED_PATH=/workspace/data/scratch/alhernan/megatron-data/checkpoints/instructed/llama-2-7b-tp4-pp1-meditron-summarization-seq4096
MODEL=llama2
TP=4
PP=1
MICRO_BATCH=1
GLOBAL_BATCH=64
INSTRUCT=1
COMMON_ARGS=--use_flash_attn --no_bias_gelu_fusion --seq_length 4096 --max_position_embeddings 4096 --log_interval 1 --save_interval 200 --eval_interval 100 --eval_iters 10 --hidden_dropout 0.0 --position_embedding_type rotary --no_bias_dropout_fusion --use_checkpoint_args --attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5 --lr_decay_style cosine --lr_warmup_fraction 0.1 --lr 2e-5 --min_lr 2e-6 --weight_decay 0.1 --sequence_parallel --recompute_granularity selective --log_timers_to_tensorboard --scalar_loss_mask=0.0 --rope_scaling_factor 1.0 --variable_seq_lengths --data_type instruction --metrics all --finetune --train_iters 1000
EXTRA_ARGS=--vocab_file=/workspace/data/scratch/llama/tokenizer.model --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --vocab_extra_ids_list [bib_ref],[/bib_ref],[fig_ref],[/fig_ref],[bib],[/bib],[fig],[/fig],[table],[/table],[formula],[/formula],<|im_start|>,<|im_end|> --vocab_file=/workspace/data/scratch/llama2/Llama-2-7b-hf/tokenizer.model --layernorm_epsilon 1e-5
Checkpoint not found to provide arguments, using provided arguments.
Traceback (most recent call last):
File "/workspace/Megatron-LLM/finetune.py", line 259, in <module>
initialize_megatron(extra_args, args_defaults)
File "/workspace/data/Megatron-LLM/megatron/initialize.py", line 45, in initialize_megatron
megatron.arguments.validate_args(args, args_defaults)
File "/workspace/data/Megatron-LLM/megatron/arguments.py", line 221, in validate_args
assert args.encoder_num_layers is not None, \
AssertionError: either num_layers or encoder_num_layers should be specified
Checkpoint not found to provide arguments, using provided arguments.
Traceback (most recent call last):
File "/workspace/Megatron-LLM/finetune.py", line 259, in <module>
initialize_megatron(extra_args, args_defaults)
File "/workspace/data/Megatron-LLM/megatron/initialize.py", line 45, in initialize_megatron
megatron.arguments.validate_args(args, args_defaults)
File "/workspace/data/Megatron-LLM/megatron/arguments.py", line 221, in validate_args
assert args.encoder_num_layers is not None, \
AssertionError: either num_layers or encoder_num_layers should be specified
My initial run parameters:
!python `pwd`/data/meditron/finetuning/sft.py \
--checkpoint=meditron \
--size=7 \
--run_name=summarization \
--data data/train.jsonl \
--val data/test.jsonl \
--micro_batch=1 \
--nodes=1 \
--save_interval=200 \
--pp=1 \
--seq 4096
Was able to track down my issue here. I was using HF checkpoints and it sounds like I need the Megatron-specific version of the weights. Documentation is here: https://epfllm.github.io/Megatron-LLM/guide/weights_conversion.html. I haven't gone through these steps yet, but this seems almost certainly my issue.
Hi, can I check if anyone has an update on this issue? @zero1zero did you manage to solve it? I managed to convert the weights but it errored out without any descriptive error message...
I did eventually solve it with the weights conversion I linked above.