""" Task description components for the leaderboard application. """ import streamlit as st from src.utils.config import tasks_info def render_task_descriptions(): """ Render the benchmark details section """ # Display the MLRC-BENCH image st.image("Assests/MLRC_Bench_overview.png", use_column_width=True) # Display the MLRC-BENCH information st.markdown(""" # MLRC-BENCH: Can Language Agents Crack ML Research Challenges? Recent advances in large language models (LLMs) have raised an intriguing question for the machine learning community: Can AI agents not only generate novel research ideas but also implement them effectively? A new benchmark, **MLRC-BENCH**, steps into the spotlight to answer this very question. ## What Is MLRC-BENCH? MLRC-BENCH is a dynamic benchmark designed to objectively evaluate whether LLM-based research agents can tackle cutting-edge ML competition tasks. Unlike previous evaluations that either focused on end-to-end paper generation or narrow engineering challenges, this benchmark splits the research workflow into two core steps: - **Idea Proposal:** Generating innovative research ideas. - **Code Implementation:** Translating those ideas into working, performance-improving code. The benchmark uses tasks sourced from recent ML conferences and workshops, ensuring the problems are both impactful and non-trivial. ## How Does It Work? MLRC-BENCH emphasizes **objective metrics**: - **Success Rate:** An agent is deemed successful if its solution improves upon a baseline by at least 5% of the margin by which the top human solution surpasses that baseline. - **Performance, Efficiency & Simplicity:** Each solution is measured not only by how well it performs but also by how efficient and simple the code is. For example, an ideal solution should achieve higher performance with minimal runtime and code complexity. Additionally, the benchmark integrates **LLM-as-a-judge evaluations** to compare subjective assessments of idea novelty with the objective performance gains. Interestingly, the study reveals a weak correlation between perceived novelty and actual performance improvements. ## Why It Matters The ability for AI agents to contribute to scientific discovery is both exciting and cautionary. While MLRC-BENCH demonstrates that current agents are not yet ready to match human ingenuity, it also provides a scalable framework to track progress and encourage future innovations. The insights gained from this benchmark could guide the development of safer, more effective AI research tools, particularly in high-stakes fields like healthcare, climate science, and AI safety. ## Looking Ahead MLRC-BENCH is built to evolve: as new ML competitions emerge, the benchmark can be updated to reflect the latest challenges. This dynamic nature ensures that it remains a relevant tool for pushing the boundaries of AI-assisted scientific research. """) st.markdown("""

🔍 Tasks in the Benchmark

Click on any task to learn more about the original benchmark.

""", unsafe_allow_html=True) # Task links mapping task_links = { "Backdoor Trigger Recovery": "https://www.llmagentsafetycomp24.com/tracks/#backdoor_model", "Machine Unlearning": "https://unlearning-challenge.github.io/", "Perception Temporal Action Loc": "https://ptchallenge-workshop.github.io", "Product Recommendation": "https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge", "Meta Learning": "https://metalearning.chalearn.org/", "Llm Merging": "https://llm-merging.github.io" } # Create two columns col1, col2 = st.columns(2) # Split tasks between the two columns with better styling task_items = list(tasks_info.items()) mid_point = len(task_items) // 2 with col1: for task, description in task_items[:mid_point]: link = task_links.get(task, "#") st.markdown(f"""

{task} 🔗

{description}

""", unsafe_allow_html=True) with col2: for task, description in task_items[mid_point:]: link = task_links.get(task, "#") st.markdown(f"""

{task} 🔗

{description}

""", unsafe_allow_html=True)