Replicating results

#4
by rohansampath - opened

Hi there - I'm trying to replicate the results you have for a handful of Open-source models with 10B params or less (e.g., Llama 3.2, Mistral-7B, others). My hardware setup is a single A100.

It currently takes ~45 minutes to run a full evaluation, 5-shot. I'm using vLLM, and mostly using your script with a couple of modifications.

A couple of questions:
(A) You report in your paper that it takes 20-30 minutes to run a 7B param model on a single A100. Any ideas on specific modifications or tricks to make the evals faster (apart from using vLLM - any modification to your script as it currently exists)? Quantization, I guess - anything else?
(B) When you run your own eval scripts on your leaderboard for larger models (e.g., Mistral-Large-Instruct, which is 123B params), what hardware setup do you use? In general, I'd love to understand the hardware structure you use for generating your leaderboard.

Thanks very much!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment