Discrepancy in LLaMA-3.2-3B-Instruct Benchmark Results

#1045
by bkhmsi - opened

Hi,

I am observing discrepancies between the results I obtain when evaluating the LLaMA-3.2-3B-Instruct model and the results published on the leaderboard for certain benchmarks.

I used the following command as specified in the Reproducibility section:

accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tokenizer=meta-llama/Llama-3.2-3B-Instruct,dtype=float16 \
    --tasks leaderboard \
    --batch_size auto \
    --output_path results \
    --apply_chat_template \
    --fewshot_as_multiturn

For GPQA (main) for example I am getting 27% accuracy, while the reported one is 3.80%. I am attaching a screenshot of the results as reported in WANDB.

Screenshot 2024-12-17 at 5.49.35 PM.png

Screenshot 2024-12-17 at 5.50.08 PM.png

Could you please clarify if there are any additional configurations or steps required to replicate the leaderboard results accurately?

Thank you!

Open LLM Leaderboard org

Hi!
You should compare your output with the raw scores (you can change the display in the table options), 3.8 % is the normalised score.

You can also check our FAQ on normalisation, it could help :)

clefourrier changed discussion status to closed

Sign up or log in to comment