Papers
arxiv:2502.14678

How to Get Your LLM to Generate Challenging Problems for Evaluation

Published on Feb 20
ยท Submitted by arkilpatel on Feb 21
Authors:
,

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.

Community

Paper author Paper submitter

Presenting โœจ ๐‚๐‡๐€๐’๐„: ๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐œ๐ก๐š๐ฅ๐ฅ๐ž๐ง๐ ๐ข๐ง๐  ๐ฌ๐ฒ๐ง๐ญ๐ก๐ž๐ญ๐ข๐œ ๐๐š๐ญ๐š ๐Ÿ๐จ๐ซ ๐ž๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง โœจ

Why synthetic data for evaluation?

  • Creating โ€œhardโ€ problems using humans is expensive (and may hit a limit soon!)
  • Impractical for humans to annotate long-context data
  • Other benefits: scalable, renewable, mitigate contamination concerns

๐‚๐‡๐€๐’๐„ automatically generates challenging evaluation problems across 3 domains:

  1. ๐‚๐‡๐€๐’๐„-๐๐€: Long-context question answering
  2. ๐‚๐‡๐€๐’๐„-๐‚๐จ๐๐ž: Repo-level code generation
  3. ๐‚๐‡๐€๐’๐„-๐Œ๐š๐ญ๐ก: Math reasoning

๐‚๐‡๐€๐’๐„ uses 2 simple ideas:

  1. Bottom-up creation of complex context by โ€œhidingโ€ components of reasoning process
  2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks

Results:

  • SOTA LLMs achieve 40-60% performance
  • ๐‚๐‡๐€๐’๐„ distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
  • While LLMs today have 128k-1M context sizes, ๐‚๐‡๐€๐’๐„ shows they struggle to reason even at ~50k context size

๐๐จ๐ญ๐ž: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.

Links:
Data: tinyurl.com/chase-data
Code: https://github.com/McGill-NLP/CHASE

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.14678 in a model README.md to link it from this page.

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.14678 in a Space README.md to link it from this page.

Collections including this paper 2