File size: 6,858 Bytes
ed2eb44
 
 
 
 
06d4ee9
ed2eb44
 
 
 
 
 
9bc8e05
ed2eb44
 
 
 
8d50b1e
 
 
ed2eb44
a10c4ff
ed2eb44
8d50b1e
ed2eb44
8d50b1e
ed2eb44
8d50b1e
 
 
ed2eb44
8d50b1e
ed2eb44
a10c4ff
ed2eb44
8d50b1e
 
 
 
 
 
 
 
 
 
 
ed2eb44
8d50b1e
a10c4ff
8d50b1e
 
 
 
a10c4ff
 
 
8d50b1e
 
 
 
 
 
 
a10c4ff
8d50b1e
a10c4ff
 
 
8d50b1e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a10c4ff
8d50b1e
ed2eb44
 
 
 
 
 
 
a10c4ff
ed2eb44
 
 
 
06d4ee9
 
ed2eb44
 
 
 
 
fc0a17a
 
ed2eb44
06d4ee9
 
 
ed2eb44
 
 
 
 
 
 
 
 
 
 
 
 
06d4ee9
 
ed2eb44
 
 
 
 
 
 
 
 
06d4ee9
 
ed2eb44
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
"""
Task description components for the leaderboard application.
"""
import streamlit as st
from src.utils.config import tasks_info
from src.utils.task_mapping import get_display_name, get_original_name

def render_task_descriptions():
    """
    Render the benchmark details section
    """
    # Display the MLRC-BENCH image
    st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)
    
    # Display the MLRC-BENCH information
    st.markdown("""

# Can Language Agents Solve Machine Learning Research Challenges?

🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.

---

## 🤖 What's the Problem?

While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**.

Most existing efforts either:
- Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas).
- Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation.

Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**.

---

## 🧪 Enter MLRC-BENCH

**MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
- LLM safety
- Multimodal perception
- Few-shot learning
- Machine unlearning
- Meta learning
- And more!

Each task demands novel method design—not just re-implementing existing solutions.

### ✅ What Makes MLRC-BENCH Unique?

- **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
- **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
- **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden.
- **Continually Updated**: New competition tasks will be added as ML research progresses.

---

## 📉 What Did We Find?

Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**:

- The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution.
- Providing additional ideas from humans or other agents doesn't consistently help.
- LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.

📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**.

---

## 🔬 Under the Hood

MLRC-BENCH comes with:
- **7 fully prepared tasks** with unified code structure.
- **Development & test splits** for fair comparison.
- **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**.
- A leaderboard showcasing normalized improvements over baselines.

> Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!

---

## 🧠 Why This Matters

MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks:
> Can LLMs **propose and implement** solutions that outperform known baselines on hard problems?

If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**.

---

## 📍 Try It Yourself

Check out the tasks and submit your own agent:

👉 We will open the link for submission in the near future. Stay tuned!

Let’s see if your agent can beat the benchmark!

    """)
    
    st.markdown("""
    <div class="card">
        <div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
        <p style="margin-bottom: 20px;">
            Click on any task to learn more.
        </p>
    </div>
    """, unsafe_allow_html=True)

    # Task links mapping - using original task names
    original_task_links = {
        "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
        "Machine Unlearning": "https://unlearning-challenge.github.io/",
        "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
        "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
        "Meta Learning": "https://metalearning.chalearn.org/",
        "Llm Merging": "https://llm-merging.github.io",
        "Rainfall Prediction": "https://weather4cast.net/neurips-2023/"
    }
    
    # Update links mapping to use display names as keys
    task_links = {get_display_name(task): link for task, link in original_task_links.items()}

    # Create two columns
    col1, col2 = st.columns(2)
    
    # Split tasks between the two columns with better styling
    task_items = list(tasks_info.items())
    mid_point = len(task_items) // 2
    
    with col1:
        for task, description in task_items[:mid_point]:
            link = task_links.get(task, "#")
            st.markdown(f"""
            <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
                <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
                    <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
                </div>
            </a>
            """, unsafe_allow_html=True)
    
    with col2:
        for task, description in task_items[mid_point:]:
            link = task_links.get(task, "#")
            st.markdown(f"""
            <a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
                <div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
                    <div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
                </div>
            </a>
            """, unsafe_allow_html=True)