open-llm-leaderboard/open_llm_leaderboard · Discrepancy in LLaMA-3.2-3B-Instruct Benchmark Results

Dec 17, 2024

Hi,

I am observing discrepancies between the results I obtain when evaluating the LLaMA-3.2-3B-Instruct model and the results published on the leaderboard for certain benchmarks.

I used the following command as specified in the Reproducibility section:

accelerate launch -m lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct,tokenizer=meta-llama/Llama-3.2-3B-Instruct,dtype=float16 \
    --tasks leaderboard \
    --batch_size auto \
    --output_path results \
    --apply_chat_template \
    --fewshot_as_multiturn

For GPQA (main) for example I am getting 27% accuracy, while the reported one is 3.80%. I am attaching a screenshot of the results as reported in WANDB.

Could you please clarify if there are any additional configurations or steps required to replicate the leaderboard results accurately?

Thank you!

clefourrier

Open LLM Leaderboard org Dec 18, 2024

Hi!
You should compare your output with the raw scores (you can change the display in the table options), 3.8 % is the normalised score.

You can also check our FAQ on normalisation, it could help :)

clefourrier changed discussion status to closed Dec 18, 2024