arxiv:2502.14678

How to Get Your LLM to Generate Challenging Problems for Evaluation

Published on Feb 20

· Submitted by

arkilpatel on Feb 21

Upvote

Authors:

Arkil Patel ,

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

View arXiv page View PDF Add to collection

Community

arkilpatel

Paper author Paper submitter 3 days ago

Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨

Why synthetic data for evaluation?

Creating “hard” problems using humans is expensive (and may hit a limit soon!)
Impractical for humans to annotate long-context data
Other benefits: scalable, renewable, mitigate contamination concerns

𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning

𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

Bottom-up creation of complex context by “hiding” components of reasoning process
Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks

Results:

SOTA LLMs achieve 40-60% performance
𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size

𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.

Links:
Data: tinyurl.com/chase-data
Code: https://github.com/McGill-NLP/CHASE