HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Abstract
HARDTESTGEN creates a large, high-quality competitive programming dataset to enhance the precision and recall of verifiers in evaluating LLM-generated code.
Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.
Community
See examples and results at: https://leililab.github.io/HardTests/
RLVR is not just about RL, it's more about VR!
Particularly for LLM coding, good verifiers (tests) are hard to get!
In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter?
Current tests are BAD. Some of them are too easy to break inefficient programs. Others lack special judge functions for program outputs and mistake a right program for a wrong one. Combined, they create LOTS of false positives/negatives. So what do we do?
We propose HardTestGen, an LLM-based test synthesis pipeline that gets you much better tests than the ones that people often use, such as TACO. With that, we curate a problem set with 47k competition problems and good tests. But why should you care?
We run post-training experiments in 3 scenarios -- teacher-distillation, self-distillation, and RL -- to study when good tests matter. It turns out that they don't, for teacher-distillation. However, they matter a great deal for self-distillation and RL.
Our problem set is now available at https://huggingface.co/datasets/sigcp/hardtests_problems, with the synthesis code and synthetic tests coming soon.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs (2025)
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs (2025)
- SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs (2025)
- rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset (2025)
- Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback (2025)
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models (2025)
- Iterative Self-training for Code Generation via Reinforced Re-ranking (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper