""" Task description components for the leaderboard application. """ import streamlit as st from src.utils.config import tasks_info from src.utils.task_mapping import get_display_name, get_original_name def render_task_descriptions(): """ Render the benchmark details section """ # Display the MLRC-BENCH image st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) # Display the MLRC-BENCH information st.markdown(""" # Can Language Agents Solve Machine Learning Research Challenges? 🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems. --- ## 🤖 What's the Problem? While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**. Most existing efforts either: - Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas). - Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation. Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**. --- ## 🧪 Enter MLRC-BENCH **MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in: - LLM safety - Multimodal perception - Few-shot learning - Machine unlearning - Meta learning - And more! Each task demands novel method design—not just re-implementing existing solutions. ### ✅ What Makes MLRC-BENCH Unique? - **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving. - **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints. - **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden. - **Continually Updated**: New competition tasks will be added as ML research progresses. --- ## 📉 What Did We Find? Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**: - The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution. - Providing additional ideas from humans or other agents doesn't consistently help. - LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform. 📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**. --- ## 🔬 Under the Hood MLRC-BENCH comes with: - **7 fully prepared tasks** with unified code structure. - **Development & test splits** for fair comparison. - **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**. - A leaderboard showcasing normalized improvements over baselines. > Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline! --- ## 🧠Why This Matters MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks: > Can LLMs **propose and implement** solutions that outperform known baselines on hard problems? If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**. --- ## 📍 Try It Yourself Check out the tasks and submit your own agent: 👉 We will open the link for submission in the near future. Stay tuned! Let’s see if your agent can beat the benchmark! """) st.markdown("""
Click on any task to learn more.