Is the evaluation setting consistent with the setup described in the paper?

#2
by liujin99 - opened

I noticed that the scores on the leaderboard are slightly different from those in the paper, so I have a few questions to ask:

  1. Is the evaluation on the leaderboard still aligned with the settings described in the paper? For example, does the mmlu_robustness benchmark use a subset of 500 samples as mentioned in the paper?
  2. Are some of the scores on the leaderboard still derived from official release evaluations, similar to those marked with ∗ in the paper?

Thanks in advance for your clarification!

liujin99 changed discussion title from Is the evaluation of certain benchmarks conducted in debug mode? to Is the evaluation setting consistent with the setup described in the paper?

Sign up or log in to comment