latticeflow/compl-ai-board · Is the evaluation setting consistent with the setup described in the paper?

I noticed that the scores on the leaderboard are slightly different from those in the paper, so I have a few questions to ask:

Is the evaluation on the leaderboard still aligned with the settings described in the paper? For example, does the mmlu_robustness benchmark use a subset of 500 samples as mentioned in the paper?
Are some of the scores on the leaderboard still derived from official release evaluations, similar to those marked with ∗ in the paper?

Thanks in advance for your clarification!