TIGER-Lab/MMLU-Pro · Replicating results

Hi there - I'm trying to replicate the results you have for a handful of Open-source models with 10B params or less (e.g., Llama 3.2, Mistral-7B, others). My hardware setup is a single A100.

It currently takes ~45 minutes to run a full evaluation, 5-shot. I'm using vLLM, and mostly using your script with a couple of modifications.

A couple of questions:
(A) You report in your paper that it takes 20-30 minutes to run a 7B param model on a single A100. Any ideas on specific modifications or tricks to make the evals faster (apart from using vLLM - any modification to your script as it currently exists)? Quantization, I guess - anything else?
(B) When you run your own eval scripts on your leaderboard for larger models (e.g., Mistral-Large-Instruct, which is 123B params), what hardware setup do you use? In general, I'd love to understand the hardware structure you use for generating your leaderboard.

Thanks very much!