Spaces:
Running
on
CPU Upgrade
NEW! OpenLLMLeaderboard 2023 fall update
We spent A YEAR of GPU time for the biggest update of the Open LLM Leaderboard yet! 🤯
With @SaylorTwift , we added 3 new benchmark metrics from the great EleutherAI harness 💥 and re-ran 2000+ models on them! 🚀
🤔 Why?
Our initial evaluations were multiple-choice Q/A datasets:
- 📚 MMLU, knowledge across many domains
- 📚 👩🔬 ARC, high-school science knowledge
- HellaSwag, choosing the plausible next steps of a list of actions.
- 📚👻 TruthfulQA, logical fallacies and knowledge bias
So... mostly knowledge and some reasoning.
But we wanted
🔭 model creators to get more information on their models capabilities
🔎 model users to select models on metrics relevant for them
⚖ leaderboard rankings to be fairer
🤔 How?
We added 3 harder evaluations, on new capabilities!
DROP 💧
Questions on wikipedia paragraphs. It requires both 1) reading comprehension to extract relevant information and 2) reasoning steps (subtractions, additions, comparisons, counting or sorting, ...) to solve the questions. Many models struggle with it!
Contrary to previous evals, it is generative: the model is not just looking at suggested choices, but must actually generate its own answers. That makes it more relevant to study the actual reasoning capabilities of models in unconstrained setups.GSM8K 🧮
Diverse grade school math problems. Math was a highly expected and requested new capability to study, with reason; current models have a lot of room to improve on math, and it's a very exciting research direction!WinoGrande 🍷
Multiple-choice adversarial Winograd completion dataset.
An example must be filled with one of two words - the model must select the most relevant one for the blank. The opposite word drastically changes the meaning of the sentence.
It's a development of the historically significant Winograde Schema Challenge: for a long time one of the most difficult benchmark ever!
🤔 What about the rankings?
- 💪Pretrained models rankings almost did not change! → Good models stay good, no matter the evals 🏅
- 🌀 Fine tuned models saw many rerankings for 13B+, and IFT/RL models did not change that much, apart from Hermes models (↓) & Beluga/Trurl models (↑) → We hope it'll help see which fine-tuning strategies are best across tasks!
🤔 Diving very deep in these benchmarks 👀
We've found interesting implementation questions (reminiscent of our blog post on MMLU: https://huggingface.co/blog/evaluating-mmlu-leaderboard).
Feel free to read more on it and join the discussion at https://github.com/EleutherAI/lm-evaluation-harness/issues/978 or here!
🤗 That's it!
We hope you find these new evals interesting, and learn more about your (favorite) models along the way!
Thank you very much for following the leaderboard. We'll keep on upgrading it so it stays a useful resource for the community, & further help model progress 🚀
Many special thanks to:
Awesome ❤️
Thank you all for doing this, and keeping us clued in on model performance 👍
Is there in the evaluations, one evaluation at least that check the language level of the model for various language?
I think it's very important for specific us in differents countries.
Thanks for this better leaderboard
@Ostixe360
Hi!
We are planning on working on multilingual leaderboards with some partners in the following months, but this is only at very early stages
Being French myself, I 100% agree that we need to evaluate models on more than "just English" 😅
However, in the meantime, you can look at the Upstage leaderboard for Korean capabilities, and the Open Compass one for Chinese capabilities.
Regardless of evaluation results, currently it is pretty hard to find models in your language or based on certain specific criteria like maximum VRAM size. I personally also tried the full-text model search of Huggingface, but this seems to be quite inefficient unfortunately. While the original huggingface leaderboard does not allow you to filter by language, you can filter by it on this website: https://llm.extractum.io/list. Just left-click on the language column. It also queries the hugginface leaderboard average model score for most models. Of course, those scores might be skewed based on the english evaluation.
Thank both of you for your responses.
Does the eval use custom instructions best suited for the model? For some models, such as Yi, using their custom instruction usually produces way better results.
I notice some TruthfulQA scores are missing.
Just sort by worst scores to show them.
@HenkPoley Thank you for reporting! It was a display problem, should be fixed! :)
Mistral Dolphin 7B seems to be missing. The 2.0 or 2.1 etc. Like deleted or made private or something? Who makes those kinds of decisions? I sometimes notice they may reappear again soon. They just being re-tested or was there a possible flaw in the benchmark etc? Thanks.
@Goldenblood56 they should still be appearing when you select the "Show gated/deleted/..." checkbox - I'm investigating why. If they are not, please open a dedicated issue so we can keep track.
If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.
If ability of model to be used as an agent can be judged through a matric then please add that matric as well. How good they are in choosing correct tool for the task, parse their own output, adjusting output to be sent as input to next step.
Great idea!
@clefourrier
I've noticed an issue in the DROP implementation by EleutherAI using the commit hash b281b0921b636bc36ad05c0b0b0763bd6dd43463
. By default, all models continue generating text until the first "." (see this line), so without any filtering, the F1 metrics are computed using overly lengthy generated texts. For example, Mistral 7B generates 10\n\nPassage: The 2006-07 season was the 10th season for the New Orleans Hornets in the National Basketball Association
for the first dataset example. Considering typical LLM behaviors, we should filter the answer using Passage:
and calculate the scores using 10
instead of the entire generated text (which is the correct answer by the way). Please see a similar filter used in GSM8K from EleutherAI here.
@binhtang
(and cc @Phil337 since you've been concerned about this too)
Thank you for your report!
We've spent the last weeks investigating the DROP scores in more detail, and found concerning things in the scoring - what you just highlighted is not the only problem in the DROP metrics.
We'll publish a blog post about it very soon, and update the leaderboard accordingly
If you visit https://llm.extractum.io/list/?lbonly, you'll find a comprehensive list of our top models, along with a whole host of other parameters and metrics. Clicking on a specific model allows you to delve deeper into its internals and parameters, including its performance on other benchmarks.
@gregzem
Nice visualization! What do the icon next to the model names correspond to?