""" Task description components for the leaderboard application. """ import streamlit as st from src.utils.config import tasks_info from src.utils.task_mapping import get_display_name, get_original_name def render_task_descriptions(): """ Render the benchmark details section """ # Display the MLRC-BENCH image st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) # Display the MLRC-BENCH information st.markdown(""" # Can Language Agents Solve Machine Learning Research Challenges? 🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems. --- ## 🤖 What's the Problem? While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**. Most existing efforts either: - Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas). - Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation. Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**. --- ## 🧪 Enter MLRC-BENCH **MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in: - LLM safety - Multimodal perception - Few-shot learning - Machine unlearning - Meta learning - And more! Each task demands novel method design—not just re-implementing existing solutions. ### ✅ What Makes MLRC-BENCH Unique? - **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving. - **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints. - **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden. - **Continually Updated**: New competition tasks will be added as ML research progresses. --- ## 📉 What Did We Find? Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**: - The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution. - Providing additional ideas from humans or other agents doesn't consistently help. - LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform. 📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**. --- ## 🔬 Under the Hood MLRC-BENCH comes with: - **7 fully prepared tasks** with unified code structure. - **Development & test splits** for fair comparison. - **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**. - A leaderboard showcasing normalized improvements over baselines. > Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline! --- ## 🧠 Why This Matters MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks: > Can LLMs **propose and implement** solutions that outperform known baselines on hard problems? If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**. --- ## 📍 Try It Yourself Check out the tasks and submit your own agent: 👉 We will open the link for submission in the near future. Stay tuned! Let’s see if your agent can beat the benchmark! """) st.markdown("""
🔍 Tasks in the Benchmark

Click on any task to learn more.

""", unsafe_allow_html=True) # Task links mapping - using original task names original_task_links = { "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model", "Machine Unlearning": "https://unlearning-challenge.github.io/", "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io", "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge", "Meta Learning": "https://metalearning.chalearn.org/", "Llm Merging": "https://llm-merging.github.io", "Rainfall Prediction": "https://weather4cast.net/neurips-2023/" } # Update links mapping to use display names as keys task_links = {get_display_name(task): link for task, link in original_task_links.items()} # Create two columns col1, col2 = st.columns(2) # Split tasks between the two columns with better styling task_items = list(tasks_info.items()) mid_point = len(task_items) // 2 with col1: for task, description in task_items[:mid_point]: link = task_links.get(task, "#") st.markdown(f"""
{task} đź”—
""", unsafe_allow_html=True) with col2: for task, description in task_items[mid_point:]: link = task_links.get(task, "#") st.markdown(f"""
{task} đź”—
""", unsafe_allow_html=True)