How to Get Your LLM to Generate Challenging Problems for Evaluation
Abstract
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.
Community
Presenting โจ ๐๐๐๐๐: ๐๐๐ง๐๐ซ๐๐ญ๐ข๐ง๐ ๐๐ก๐๐ฅ๐ฅ๐๐ง๐ ๐ข๐ง๐ ๐ฌ๐ฒ๐ง๐ญ๐ก๐๐ญ๐ข๐ ๐๐๐ญ๐ ๐๐จ๐ซ ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง โจ
Why synthetic data for evaluation?
- Creating โhardโ problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
๐๐๐๐๐ automatically generates challenging evaluation problems across 3 domains:
- ๐๐๐๐๐-๐๐: Long-context question answering
- ๐๐๐๐๐-๐๐จ๐๐: Repo-level code generation
- ๐๐๐๐๐-๐๐๐ญ๐ก: Math reasoning
๐๐๐๐๐ uses 2 simple ideas:
- Bottom-up creation of complex context by โhidingโ components of reasoning process
- Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
Results:
- SOTA LLMs achieve 40-60% performance
- ๐๐๐๐๐ distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, ๐๐๐๐๐ shows they struggle to reason even at ~50k context size
๐๐จ๐ญ๐: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
Links:
Data: tinyurl.com/chase-data
Code: https://github.com/McGill-NLP/CHASE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering (2025)
- Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs (2025)
- Correctness Assessment of Code Generated by Large Language Models Using Internal Representations (2025)
- Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks (2025)
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance (2025)
- Dynamic Scaling of Unit Tests for Code Reward Modeling (2025)
- LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper