Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Abstract
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4times improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.
Community
Move Scientific Research at the Speed of Thought. This paper introduces Curie, an AI agent framework designed to automate scientific research experimentation. By integrating modules that enhance reliability, enforce methodical control, and improve interpretability, Curie addresses the critical challenges of automating rigorous experimentation. Curie is able to reproduce a few AI research paper through experimentation.
Evaluated against an experimentation benchmark spanning multiple computer science domains, Curie demonstrated a 3.4× improvement in accurately answering experimental questions compared to existing baselines.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems (2025)
- Aviary: training language agents on challenging scientific tasks (2024)
- JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models (2025)
- Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations (2025)
- Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation (2025)
- Evaluating Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards 'Artificial Research Intelligence' (ARI)? (2025)
- SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
So cool that you guys used Madame's name! I had one issue with it, while the abstract really gives the idea that we would rigorous validation, I felt the questions sent to the model are a bit too specific in the sense that provides the model the precise location of the issue. it's relevant because my understanding of validation is more: given sentence x, which may or may not have an issue, is it valid? The experiment was more: given this sentence, which has a problem that needs solutions what's the solution? Appreciate it!
Thank you for the thoughtful feedback! 😊
You bring up an excellent point regarding the input questions. We aimed to strike a balance between open-ended validation (e.g., "Is this valid?", "what is the relationship between A and B?", "What is the best configuration choice") and targeted problem-solving (e.g., "What is the solution to this identified issue?").
For this initial evaluation, we focused on a more directed approach to assess Curie’s ability to provide precise, actionable insights, which is why the questions may seem specific. However, we absolutely see the value in broader, more open-ended validation tasks and agree that this is a natural and important next step. Happy to discuss more!
Thanks for this much appreciated and looking forward for the next version!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper