Spaces:
Running
on
CPU Upgrade
Discrepancy in LLaMA-3.2-3B-Instruct Benchmark Results
Hi,
I am observing discrepancies between the results I obtain when evaluating the LLaMA-3.2-3B-Instruct model and the results published on the leaderboard for certain benchmarks.
I used the following command as specified in the Reproducibility section:
accelerate launch -m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tokenizer=meta-llama/Llama-3.2-3B-Instruct,dtype=float16 \
--tasks leaderboard \
--batch_size auto \
--output_path results \
--apply_chat_template \
--fewshot_as_multiturn
For GPQA (main)
for example I am getting 27% accuracy, while the reported one is 3.80%. I am attaching a screenshot of the results as reported in WANDB.
Could you please clarify if there are any additional configurations or steps required to replicate the leaderboard results accurately?
Thank you!
Hi!
You should compare your output with the raw scores (you can change the display in the table options), 3.8 % is the normalised score.
You can also check our FAQ on normalisation, it could help :)