Revisit Self-Debugging with Self-Generated Tests for Code Generation
Abstract
Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution <PRE_TAG>self-debugging</POST_TAG>. Within the scope of self-contained Python programming tasks, we find that post-execution <PRE_TAG>self-debugging</POST_TAG> struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution <PRE_TAG>self-debugging</POST_TAG> enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper