AWorld: Dynamic Multi-Agent System with Stable Maneuvering for Robust GAIA Problem Solving Paper • 2508.09889 • Published 6 days ago • 29
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL Paper • 2508.07976 • Published 8 days ago • 44
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence Paper • 2507.21046 • Published 21 days ago • 79
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories? Paper • 2507.12415 • Published Jul 16 • 41
view article Article BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks By terryyz and 8 others • Jun 18, 2024 • 52
NuminaMath Collection Datasets and models for training SOTA math LLMs. See our GitHub for training & inference code: https://github.com/project-numina/aimo-progress-prize • 7 items • Updated Feb 10 • 78
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines Paper • 2502.14739 • Published Feb 20 • 105
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Paper • 2502.12215 • Published Feb 17 • 16
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction Paper • 2502.07316 • Published Feb 11 • 50
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models Paper • 2501.13629 • Published Jan 23 • 49
ProcessBench: Identifying Process Errors in Mathematical Reasoning Paper • 2412.06559 • Published Dec 9, 2024 • 85
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published Jul 3, 2024 • 51