CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Abstract
The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.
Community
The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities,
which can provide necessary suggestions (e.g., detailed analysis and constructive feedback).
Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several
critique benchmarks have been proposed. However, existing critique benchmarks usually have
the following limitations: (1). Focusing on diverse reasoning tasks in general domains and
insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and
MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these
limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench.
Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation
and code QA) with different difficulties. Besides, the evaluation protocols include basic critique
evaluation and advanced critique evaluation for different characteristics, where fine-grained
evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive
experimental results of existing LLMs, which show the effectiveness of CodeCriticBench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks (2025)
- CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering (2025)
- RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques (2025)
- FairCode: Evaluating Social Bias of LLMs in Code Generation (2025)
- COFFE: A Code Efficiency Benchmark for Code Generation (2025)
- Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs (2025)
- Pragmatic Reasoning improves LLM Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
๐ The code is here: https://github.com/multimodal-art-projection/CodeCriticBench
๐ It uses this dataset: https://huggingface.co/datasets/m-a-p/CodeCriticBench
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper