Spaces:
Running
on
CPU Upgrade
It's been a wild ride, folks :) (end of the Open LLM Leaderboard)
Dear community,
For the last 2 years, we've evaluated over 13K models with the Open LLM Leaderboard, using our research cluster to provide open, fair and reproducible evaluations to all.
However, all good things come to an end: the leaderboard is officially retiring!
Why?
As model capabilities change (hello reasoning and LM assistants), benchmarks need to follow!
The leaderboard is slowly becoming obsolete; we feel it could encourage people to hill climb irrelevant directions in the field. So we'd like to stop it before it happens :)
Looking back
We're proud to have played a pivotal role in driving a revolution in model evaluation!
From the very beginning, we witnessed firsthand how the leaderboard put evaluation at the center of the debates in the community, sparking fascinating discussions on Reddit and on the hub, empowering both model creators and evaluators to engage with the community in unprecedented ways.
Now, we're happy to see that this legacy is assured - over 200 community led leaderboards are available on HuggingFace, and evaluation is a more active field than ever! We hope some of these leaderboards will overthrow this one, and become, too, the most liked spaces on the hub.
Thanks to you all!
We'd like to thank all the people who followed us in this adventure and contributed (by interacting, giving a hand and of course submitting models).
Through this project, we learned a lot about evaluations, and we're happy it's been so impactful for all.
Cheers :)
PS: You'll be able to find about the new work of the team in our Open Evals hub org. Stay tuned!
Thank you!
Thanks to everyone who contributed! You can be super proud of all the impact the leaderboard has had on the field & on Hugging Face and will live on thanks to the hundreds of community leaderboards on the hub!
Thanks for everything! It's been an amazing journey.
This was an incredible resource when model finetunes and merges were first ramping up, I used to check back here daily. It's sad to see it go but it's understandable with the direction the space is moving in. These metrics have definitely lost their purpose in the face of new modalities and long cot models.
Thanks for your effort!!
Amidst the constant release of new models every day, the Open LLM Leaderboard served as a reliable compass. Great work to everyone involved!
Have you considered starting up a leaderboard for just edge models? Maybe capped at 5B parameters? Users throttled to one model per week? Finding appropriate benchmarks might be a challenge, but it would be another way to keep HF open benchmarks relevant.
Nowadays any LLM claiming to be the SOTA will present results with moderate to insane benchmark / sampling method / result cherrypicking.
I've been using this leaderboard for about a year now as a non-manipulable independent third party evaluation / cross-validation / sanity check tool.
The leaderboard is not perfect as shown in https://github.com/huggingface/Math-Verify, but It really helps.
Sad to see it comes to an end. Hope it'll come back in another form.
Thanks for your work.
Thank you.
A Realy Big Thank you for All u have done.
I have a nice graph that back-reconstructs the best average result among all tested HF models published up to that time:
I had hoped I'd get to watch this line climb to 60, 70, 80... in the years to come, but sadly that's not going to happen. It is true, nevertheless, that a single figure - useful as it may be - cannot represent LLM capabilities in a meaningful way.
Great job everyone!
What will be the replacement?
Your work with the leaderboard has been amazing, thanks for all of your efforts! 👏
For those seeking a replacement, EuroEval has been going strong for 3 years now and currently features 10 languages with more underway, with mostly gold standard datasets. Check out https://euroeval.com 🙂
oh, I'm missing it already. lol
Many cheers for your amazing work with the board! ✌️🖖
i will miss open llm leaderboard 😢
Have you considered starting up a leaderboard for just edge models? Maybe capped at 5B parameters? Users throttled to one model per week? Finding appropriate benchmarks might be a challenge, but it would be another way to keep HF open benchmarks relevant.
Well, there's this one:
https://huggingface.co/spaces/nyunai/edge-llm-leaderboard
Although the cap is bigger than your suggestion. And it seems there aren't so many benches as in here.
And I agree, with the launch of so many smaller models, I too used to check OpenLLM Leaderboard for smaller models.
Just in case, UGI has a new methodology and they evaluate smaller models
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
And while not being exactly equal (as they evaluate performance among those exams in Portuguese,) I've found many interesting models here too
https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard
I really appreciate the service this made to the community. I'd love to see more specialized successors, ones that relate to roles within teams!
@grimjim
@Blazgo
@sometimesanotion
You can check out the follow up community leaderboards through our search space :)
https://huggingface.co/spaces/OpenEvals/find-a-leaderboard
For edge devices, category would be performance - for roles within teams, you can look at the domain categories. And for a drop in replacement, there is not one per se, but depending on the specific capability you would want to study, you'll find leaderboards for math, code, different aspects of language, etc!