Spaces:
Running
Running
File size: 6,858 Bytes
ed2eb44 06d4ee9 ed2eb44 9bc8e05 ed2eb44 8d50b1e ed2eb44 a10c4ff ed2eb44 8d50b1e ed2eb44 8d50b1e ed2eb44 8d50b1e ed2eb44 8d50b1e ed2eb44 a10c4ff ed2eb44 8d50b1e ed2eb44 8d50b1e a10c4ff 8d50b1e a10c4ff 8d50b1e a10c4ff 8d50b1e a10c4ff 8d50b1e a10c4ff 8d50b1e ed2eb44 a10c4ff ed2eb44 06d4ee9 ed2eb44 fc0a17a ed2eb44 06d4ee9 ed2eb44 06d4ee9 ed2eb44 06d4ee9 ed2eb44 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
"""
Task description components for the leaderboard application.
"""
import streamlit as st
from src.utils.config import tasks_info
from src.utils.task_mapping import get_display_name, get_original_name
def render_task_descriptions():
"""
Render the benchmark details section
"""
# Display the MLRC-BENCH image
st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)
# Display the MLRC-BENCH information
st.markdown("""
# Can Language Agents Solve Machine Learning Research Challenges?
🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.
---
## 🤖 What's the Problem?
While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate **novel and effective research ideas**.
Most existing efforts either:
- Ask agents to write entire research papers, but use **subjective evaluation** (e.g., LLMs or humans judging ideas).
- Or evaluate agents on **Kaggle-style tasks**, which rarely require real innovation.
Both setups miss the mark when it comes to assessing whether LLM agents can truly **advance the ML research frontier**.
---
## 🧪 Enter MLRC-BENCH
**MLRC-BENCH** fills this gap by evaluating agents on **real ML research competitions** hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
- LLM safety
- Multimodal perception
- Few-shot learning
- Machine unlearning
- Meta learning
- And more!
Each task demands novel method design—not just re-implementing existing solutions.
### ✅ What Makes MLRC-BENCH Unique?
- **Objective Evaluation**: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
- **Compute-Constrained**: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
- **Tamper-Proof Setup**: Agents can only modify specific parts of the starter code; test data remains hidden.
- **Continually Updated**: New competition tasks will be added as ML research progresses.
---
## 📉 What Did We Find?
Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, **agents struggle**:
- The best-performing agent (Gemini under MLAB scaffolding) closes only **9.3% of the performance gap** between a baseline and top human solution.
- Providing additional ideas from humans or other agents doesn't consistently help.
- LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.
📊 **Key Insight**: There’s a clear **misalignment between subjective novelty and actual effectiveness**.
---
## 🔬 Under the Hood
MLRC-BENCH comes with:
- **7 fully prepared tasks** with unified code structure.
- **Development & test splits** for fair comparison.
- **Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code)**.
- A leaderboard showcasing normalized improvements over baselines.
> Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!
---
## 🧠 Why This Matters
MLRC-BENCH is a **stress test for research agents**. It doesn’t just ask “Can LLMs code?”—it asks:
> Can LLMs **propose and implement** solutions that outperform known baselines on hard problems?
If we want to build autonomous research agents that assist or even collaborate with human scientists, **benchmarks like MLRC-BENCH are essential**.
---
## 📍 Try It Yourself
Check out the tasks and submit your own agent:
👉 We will open the link for submission in the near future. Stay tuned!
Let’s see if your agent can beat the benchmark!
""")
st.markdown("""
<div class="card">
<div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
<p style="margin-bottom: 20px;">
Click on any task to learn more.
</p>
</div>
""", unsafe_allow_html=True)
# Task links mapping - using original task names
original_task_links = {
"Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
"Machine Unlearning": "https://unlearning-challenge.github.io/",
"Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
"Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
"Meta Learning": "https://metalearning.chalearn.org/",
"Llm Merging": "https://llm-merging.github.io",
"Rainfall Prediction": "https://weather4cast.net/neurips-2023/"
}
# Update links mapping to use display names as keys
task_links = {get_display_name(task): link for task, link in original_task_links.items()}
# Create two columns
col1, col2 = st.columns(2)
# Split tasks between the two columns with better styling
task_items = list(tasks_info.items())
mid_point = len(task_items) // 2
with col1:
for task, description in task_items[:mid_point]:
link = task_links.get(task, "#")
st.markdown(f"""
<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
</div>
</a>
""", unsafe_allow_html=True)
with col2:
for task, description in task_items[mid_point:]:
link = task_links.get(task, "#")
st.markdown(f"""
<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
</div>
</a>
""", unsafe_allow_html=True) |