If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova, this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. ✅
🏆 GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!
The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. 💪
(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)