open-llm-leaderboard/open_llm_leaderboard · Aligning MMLU questions across models

I'm interested in looking at MMLU results across models at the question level. That is, instead of looking at aggregate MMLU scores, I want to know how models do on particular questions. To facilitate this exploration I have downloaded raw MMLU output from the hub from several of the models on the leaderboard. I'm now not sure how to align questions across these model submission files.

I've noticed that MMLU output follows a particular format, seemingly defined here in the Eleuther evaluation harness code. Based on that I make certain assumptions about what constitutes question similarity: notably that doc_hash would be consistent -- corresponding to the same question -- across submission files:

Q1: Is that true?

Part of what makes me wonder is because I notice a lot of missing data when "pivoting" models with questions:

Q2: Why might that be?

(where a pivot means putting the data into a table where models are rows, questions are columns, and question scores are cell values.)

I have the feeling that it has something to do with either models not completing the benchmark, or that there are different versions of the benchmark and there is a mix of those versions across the hub datasets. Assuming it's the latter:

Q3: What's the best way to combine data such that I'm looking at consistent MMLU results?