Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1136

It's been a wild ride, folks :) (end of the Open LLM Leaderboard)

#1135

pinned

by clefourrier - opened 8 days ago

Discussion

clefourrier

Open LLM Leaderboard org 8 days ago

Dear community,
For the last 2 years, we've evaluated over 13K models with the Open LLM Leaderboard, using our research cluster to provide open, fair and reproducible evaluations to all.
However, all good things come to an end: the leaderboard is officially retiring!

Why?
As model capabilities change (hello reasoning and LM assistants), benchmarks need to follow!
The leaderboard is slowly becoming obsolete; we feel it could encourage people to hill climb irrelevant directions in the field. So we'd like to stop it before it happens :)

Looking back
We're proud to have played a pivotal role in driving a revolution in model evaluation!
From the very beginning, we witnessed firsthand how the leaderboard put evaluation at the center of the debates in the community, sparking fascinating discussions on Reddit and on the hub, empowering both model creators and evaluators to engage with the community in unprecedented ways.

Now, we're happy to see that this legacy is assured - over 200 community led leaderboards are available on HuggingFace, and evaluation is a more active field than ever! We hope some of these leaderboards will overthrow this one, and become, too, the most liked spaces on the hub.

Thanks to you all!
We'd like to thank all the people who followed us in this adventure and contributed (by interacting, giving a hand and of course submitting models).

Through this project, we learned a lot about evaluations, and we're happy it's been so impactful for all.

Cheers :)

PS: You'll be able to find about the new work of the team in our Open Evals hub org. Stay tuned!

clefourrier pinned discussion 8 days ago

Quazim0t0

8 days ago

Thank you!

clem

8 days ago

Thanks to everyone who contributed! You can be super proud of all the impact the leaderboard has had on the field & on Hugging Face and will live on thanks to the hundreds of community leaderboards on the hub!

Weyaxi

8 days ago

Thanks for everything! It's been an amazing journey.

HDiffusion

7 days ago

This was an incredible resource when model finetunes and merges were first ramping up, I used to check back here daily. It's sad to see it go but it's understandable with the direction the space is moving in. These metrics have definitely lost their purpose in the face of new modalities and long cot models.

Amu

7 days ago

Thanks for your effort!!

WarriorMama777

7 days ago

Amidst the constant release of new models every day, the Open LLM Leaderboard served as a reliable compass. Great work to everyone involved!

grimjim

7 days ago

Have you considered starting up a leaderboard for just edge models? Maybe capped at 5B parameters? Users throttled to one model per week? Finding appropriate benchmarks might be a challenge, but it would be another way to keep HF open benchmarks relevant.

Enigrand

7 days ago

•

edited 7 days ago

Nowadays any LLM claiming to be the SOTA will present results with moderate to insane benchmark / sampling method / result cherrypicking.

I've been using this leaderboard for about a year now as a non-manipulable independent third party evaluation / cross-validation / sanity check tool.

The leaderboard is not perfect as shown in https://github.com/huggingface/Math-Verify, but It really helps.

Sad to see it comes to an end. Hope it'll come back in another form.

Thanks for your work.

mtasic85

7 days ago

Thank you.

Kwokou

7 days ago

A Realy Big Thank you for All u have done.

brankor-mcom

7 days ago

I have a nice graph that back-reconstructs the best average result among all tested HF models published up to that time:

I had hoped I'd get to watch this line climb to 60, 70, 80... in the years to come, but sadly that's not going to happen. It is true, nevertheless, that a single figure - useful as it may be - cannot represent LLM capabilities in a meaningful way.

Great job everyone!

Blazgo

6 days ago

What will be the replacement?

saattrupdan

6 days ago

•

edited 6 days ago

Your work with the leaderboard has been amazing, thanks for all of your efforts! 👏

For those seeking a replacement, EuroEval has been going strong for 3 years now and currently features 10 languages with more underway, with mostly gold standard datasets. Check out https://euroeval.com 🙂

robb-0

5 days ago

oh, I'm missing it already. lol

Many cheers for your amazing work with the board! ✌️🖖

CreitinGameplays

4 days ago

i will miss open llm leaderboard 😢

robb-0

4 days ago

@grimjim

Have you considered starting up a leaderboard for just edge models? Maybe capped at 5B parameters? Users throttled to one model per week? Finding appropriate benchmarks might be a challenge, but it would be another way to keep HF open benchmarks relevant.

Well, there's this one:

https://huggingface.co/spaces/nyunai/edge-llm-leaderboard

Although the cap is bigger than your suggestion. And it seems there aren't so many benches as in here.

And I agree, with the launch of so many smaller models, I too used to check OpenLLM Leaderboard for smaller models.

Just in case, UGI has a new methodology and they evaluate smaller models
https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard

And while not being exactly equal (as they evaluate performance among those exams in Portuguese,) I've found many interesting models here too
https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard

sometimesanotion

2 days ago

I really appreciate the service this made to the community. I'd love to see more specialized successors, ones that relate to roles within teams!

clefourrier

Open LLM Leaderboard org 2 days ago

@grimjim @Blazgo @sometimesanotion
You can check out the follow up community leaderboards through our search space :)
https://huggingface.co/spaces/OpenEvals/find-a-leaderboard

For edge devices, category would be performance - for roles within teams, you can look at the domain categories. And for a drop in replacement, there is not one per se, but depending on the specific capability you would want to study, you'll find leaderboards for math, code, different aspects of language, etc!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment