are benchmark scores normalised to a baseline?

#2
by Abulaphia - opened

In the documentation, I see reference to a baseline model for GSM8k. Are the scores for models on the archived leaderboard raw scores, or are they normalised in some way / compared to a standard benchmark? If the latter, is there somewhere I can find details on the methodology?

Open LLM Leaderboard Archive org

Hi! Here they are all raw, we added normalisation in the v2 only :)
The baseline scores (for the row "baseline") were taken from the papers introducing the benchmarks each time.

clefourrier changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment