""" Task description components for the leaderboard application. """ import streamlit as st from src.utils.config import tasks_info from src.utils.task_mapping import get_display_name, get_original_name def render_task_descriptions(): """ Render the benchmark details section """ # Display the MLRC-BENCH image st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) # Display the MLRC-BENCH information st.markdown(""" ## MLRC-BENCH: Can Language Agents Solve ML Research Challenges? Recent advances in large language models (LLMs) have motivated a critical question in the machine learning community: can AI agents not only propose novel research ideas but also translate them into effective implementations? **MLRC-BENCH** is introduced as a new benchmark to investigate this question by rigorously evaluating the capacity of LLM-based research agents to address contemporary ML competition tasks. --- ### Benchmark Overview MLRC-BENCH seeks to assess AI-driven research workflows in two primary dimensions: - **Idea Proposal**: Generating plausible and potentially innovative methods for addressing current ML research problems. - **Code Implementation**: Translating these ideas into executable solutions that measurably improve performance over a baseline. This design contrasts with prior benchmarks that emphasize either (1) full end-to-end paper generation assessed by subjective human or LLM reviews, or (2) isolated code-generation tasks that focus on engineering challenges. By dividing the problem into idea proposal and implementation, MLRC-BENCH provides a clearer measure of how well agents can form and operationalize research insights. --- ### Evaluation Criteria For each agent on a given task, MLRC-BENCH measures performance relative to a **baseline** method and a **top human** benchmark. We report two primary metrics, each taken from the maximum result across all experimental runs for a task-model pair: - **Relative Improvement to Human** How effectively the agent closes the gap between the baseline and the best human solution. - **Absolute Improvement to Baseline** How much better the agent performs compared to the baseline, expressed as a percentage gain. --- ### Significance MLRC-BENCH emphasizes rigorous and reproducible evaluations, focusing on tasks drawn from recent machine learning conferences and workshops to ensure that tested methods are both **meaningful** and **nontrivial**. This dynamic approach allows the benchmark to grow as new competition tasks arise, enabling continuous monitoring of progress in agent-driven research. Through its emphasis on objective success criteria, MLRC-BENCH fosters the development of AI agents that more effectively balance conceptual innovation with practical impact. --- ### Future Directions While current results suggest that LLM-based research agents still fall short of human capabilities in creativity and code implementation, MLRC-BENCH provides a **scalable mechanism** to track and accelerate progress. As AI methods advance—and potentially branch into high-stakes domains such as healthcare and climate modeling—this benchmark could serve as a critical resource for aligning agent innovation with **reliability** and **safety**. """) st.markdown("""
Click on any task to learn more.