MLRC_Bench

Running

App Files Files Community

MLRC_Bench / src /components /tasks.py

yunx-z

Update src/components/tasks.py

8d50b1e verified 7 days ago

raw

history blame

6.86 kB

	"""
	Task description components for the leaderboard application.
	"""
	import streamlit as st
	from src.utils.config import tasks_info
	from src.utils.task_mapping import get_display_name, get_original_name

	def render_task_descriptions():
	"""
	Render the benchmark details section
	"""
	# Display the MLRC-BENCH image
	st.image("Assests/MLRC_Bench_overview.png", use_column_width=True)

	# Display the MLRC-BENCH information
	st.markdown("""

	# Can Language Agents Solve Machine Learning Research Challenges?

	🚀 Introducing [MLRC-BENCH](https://huggingface.co/spaces/launch/MLRC_Bench), a new benchmark suite designed to test the scientific chops of LLM-based agents on real-world machine learning (ML) research problems.

	---

	## 🤖 What's the Problem?

	While recent language model (LLM) agents have made impressive strides in reasoning, coding, and even paper writing, current benchmarks fall short in evaluating their ability to generate novel and effective research ideas.

	Most existing efforts either:
	- Ask agents to write entire research papers, but use subjective evaluation (e.g., LLMs or humans judging ideas).
	- Or evaluate agents on Kaggle-style tasks, which rarely require real innovation.

	Both setups miss the mark when it comes to assessing whether LLM agents can truly advance the ML research frontier.

	---

	## 🧪 Enter MLRC-BENCH

	MLRC-BENCH fills this gap by evaluating agents on real ML research competitions hosted at NeurIPS, ECCV, and other top venues. These tasks represent cutting-edge challenges in:
	- LLM safety
	- Multimodal perception
	- Few-shot learning
	- Machine unlearning
	- Meta learning
	- And more!

	Each task demands novel method design—not just re-implementing existing solutions.

	### ✅ What Makes MLRC-BENCH Unique?

	- Objective Evaluation: Agents are scored on real metrics (accuracy, ROUGE, MRR, etc.)—no LLM-as-a-judge handwaving.
	- Compute-Constrained: Tasks come with GPU and runtime limits, simulating real-world resource constraints.
	- Tamper-Proof Setup: Agents can only modify specific parts of the starter code; test data remains hidden.
	- Continually Updated: New competition tasks will be added as ML research progresses.

	---

	## 📉 What Did We Find?

	Despite access to top-tier LLMs like GPT-4o, Claude 3.5, and Gemini, agents struggle:

	- The best-performing agent (Gemini under MLAB scaffolding) closes only 9.3% of the performance gap between a baseline and top human solution.
	- Providing additional ideas from humans or other agents doesn't consistently help.
	- LLMs often rate their own ideas as “innovative,” but objective metrics show they underperform.

	📊 Key Insight: There’s a clear misalignment between subjective novelty and actual effectiveness.

	---

	## 🔬 Under the Hood

	MLRC-BENCH comes with:
	- 7 fully prepared tasks with unified code structure.
	- Development & test splits for fair comparison.
	- Metrics for effectiveness, efficiency (runtime), and simplicity (lines of code).
	- A leaderboard showcasing normalized improvements over baselines.

	> Normalized scores range from 0 (baseline) to 100 (top human performance). Scores < 0 mean agents underperform the baseline!

	---

	## 🧠 Why This Matters

	MLRC-BENCH is a stress test for research agents. It doesn’t just ask “Can LLMs code?”—it asks:
	> Can LLMs propose and implement solutions that outperform known baselines on hard problems?

	If we want to build autonomous research agents that assist or even collaborate with human scientists, benchmarks like MLRC-BENCH are essential.

	---

	## 📍 Try It Yourself

	Check out the tasks and submit your own agent:

	👉 We will open the link for submission in the near future. Stay tuned!

	Let’s see if your agent can beat the benchmark!

	""")

	st.markdown("""
	<div class="card">
	<div class="card-title"><span class="card-title-icon">🔍</span> Tasks in the Benchmark</div>
	<p style="margin-bottom: 20px;">
	Click on any task to learn more.
	</p>
	</div>
	""", unsafe_allow_html=True)

	# Task links mapping - using original task names
	original_task_links = {
	"Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model",
	"Machine Unlearning": "https://unlearning-challenge.github.io/",
	"Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io",
	"Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge",
	"Meta Learning": "https://metalearning.chalearn.org/",
	"Llm Merging": "https://llm-merging.github.io",
	"Rainfall Prediction": "https://weather4cast.net/neurips-2023/"
	}

	# Update links mapping to use display names as keys
	task_links = {get_display_name(task): link for task, link in original_task_links.items()}

	# Create two columns
	col1, col2 = st.columns(2)

	# Split tasks between the two columns with better styling
	task_items = list(tasks_info.items())
	mid_point = len(task_items) // 2

	with col1:
	for task, description in task_items[:mid_point]:
	link = task_links.get(task, "#")
	st.markdown(f"""
	<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
	<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
	<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
	</div>
	</a>
	""", unsafe_allow_html=True)

	with col2:
	for task, description in task_items[mid_point:]:
	link = task_links.get(task, "#")
	st.markdown(f"""
	<a href="{link}" target="_blank" style="text-decoration: none; color: inherit;">
	<div class="task-card" style="cursor: pointer; transition: transform 0.2s, box-shadow 0.2s; padding: 12px; margin-bottom: 15px; height: auto;" onmouseover="this.style.transform='translateY(-5px)'; this.style.boxShadow='0 8px 15px rgba(0, 0, 0, 0.2)';" onmouseout="this.style.transform='translateY(0)'; this.style.boxShadow='0 4px 6px rgba(0, 0, 0, 0.15)';">
	<div class="task-title" style="text-align: center;">{task} <span style="font-size: 14px; opacity: 0.7;">🔗</span></div>
	</div>
	</a>
	""", unsafe_allow_html=True)