Papers
arxiv:2502.16069

Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Published on Feb 22
· Submitted by AmberLJC on Feb 26
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4times improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.

Community

Paper submitter

Move Scientific Research at the Speed of Thought. This paper introduces Curie, an AI agent framework designed to automate scientific research experimentation. By integrating modules that enhance reliability, enforce methodical control, and improve interpretability, Curie addresses the critical challenges of automating rigorous experimentation. Curie is able to reproduce a few AI research paper through experimentation.
Evaluated against an experimentation benchmark spanning multiple computer science domains, Curie demonstrated a 3.4× improvement in accurately answering experimental questions compared to existing baselines.

So cool that you guys used Madame's name! I had one issue with it, while the abstract really gives the idea that we would rigorous validation, I felt the questions sent to the model are a bit too specific in the sense that provides the model the precise location of the issue. it's relevant because my understanding of validation is more: given sentence x, which may or may not have an issue, is it valid? The experiment was more: given this sentence, which has a problem that needs solutions what's the solution? Appreciate it!

Paper submitter

Thank you for the thoughtful feedback! 😊

You bring up an excellent point regarding the input questions. We aimed to strike a balance between open-ended validation (e.g., "Is this valid?", "what is the relationship between A and B?", "What is the best configuration choice") and targeted problem-solving (e.g., "What is the solution to this identified issue?").

For this initial evaluation, we focused on a more directed approach to assess Curie’s ability to provide precise, actionable insights, which is why the questions may seem specific. However, we absolutely see the value in broader, more open-ended validation tasks and agree that this is a natural and important next step. Happy to discuss more!

Thanks for this much appreciated and looking forward for the next version!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16069 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16069 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16069 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.