arxiv:2506.14245

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Published on Jun 17

· Submitted by

shun-zheng on Jun 18

Upvote

Authors:

Xumeng Wen ,

Shun Zheng ,

Zhijian Xu ,

Xiao Liang ,

Yang Wang ,

Jiang Bian ,

Abstract

RLVR advances machine reasoning by incentivizing correct and logical thought chains, addressing limitations identified by a more precise evaluation metric, $CoT$-$Pass@K$.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

View arXiv page View PDF Add to collection

Community

shun-zheng

Paper author Paper submitter Jun 18

We present a theoretical framework and empirical evidence demonstrating that reinforcement learning with verifiable rewards (RLVR) implicitly incentivizes correct reasoning in large language models (LLMs). This insight resolves a key debate in the field: whether RLVR-driven improvements extend beyond the inherent capabilities of base LLMs. While prevailing assumptions attribute gains in Pass@1 solely to the original Pass@K performance of pretrained models, our findings reveal that RLVR actively promotes deeper reasoning as training progresses.

duinamit

Jun 18

yeah would be worthwhile investigating what all the pass@x results of SOTA reasoning models contained in their actual CoT in hindsight.

shun-zheng

Paper author Jun 18

Post-RLVR or distillation reasoning models generally demonstrate significantly higher probabilities of correct CoT reasoning compared to base models or instruction models.

Regarding SOTA reasoning models, most of their CoTs are correct actually.

librarian-bot

Jun 19

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

shun-zheng

Paper author Jun 19

Thank you for recommending interesting studies containing "verification" concepts, Librarian Bot :)

I have checked these papers and will cite some of them in our revised manuscript, while I also notice that our work addresses some key questions/insights that left unanswered in these existing studies:

RLVR implicitly incentivizes correct reasoning paths.
- A theoretical understanding of this concept is very important.
Verifying the correctness of reasoning, beyond the correctness of the final answer, is very important for math benchmarks
- Previous studies mainly considered verification of final results
Why can AceReason-Nemotron demonstrate persistent Pass@K performance gaps between base and post-RLVR models?
- Because the correct reasoning behavivors have been constantly incentivized during their large-scale RLVR training.

ykarout

Jul 4

Very interesting paper.
I have personally experimented with this concept through extensive manual testing on base-model vs RLVR model. I observed that on hard tasks/prompts that were not relevant to the RL training dataset, the RLVR model can indeed explore new paths to solving the task/problem and reach a correct solution where the base-model couldn’t. On simple tasks, often the base-model can reach a correct solution quicker and sometimes the over-thinking of the RL model can divert it from reaching the correct answer; however this is rare from my personal observations. And indeed, the RL model can definitely generalize the CoT/Reasoning on tasks not seen in training.
I have been recently experimenting with RPT (Reinforcement Pre-Training) specifically on the model you used as a verifier for the reasoning patterns. I am interested in testing whether RPT can also lead to higher scores on your proposed benchmarking framework; but since the RPT model in this case is fine-tuned from the same base verifier you’re using, what could be an alternative verifier model?
Have you experimented with various verifier models? And do they lead to similar results or is there any kind of bias formed? Are you considering several metrics in the verification such as completeness, accuracy, divergence, repetitive loops, stuck patterns and contradictions detected in the reasoning patterns?

shun-zheng

Paper author Jul 7

Hi Ykarout, please check my reply in the next frame :)

shun-zheng

Paper author Paper submitter Jul 7

Hi Ykarout,

Thank you for your personal verifications and positing so many insightful questions.

Yes, it is very challenging to obtain a reliable verifier, especially when your policy model is extremely strong. For example, the ds-distilled model can help to find reasoning mistakes of a base model with much weaker reasoning capabilities. But if using the ds-distilled model as a new policy model, we may need much much stronger models or even human experts to verify. An alternative approach could be using a multi-verification approach, quering the verifier multiple times and aggregating ensemble results into the final verification conclusion.

We have mostly experimented with different DeepSeek series, such as R1 (600B+), R1-0528 (600B+), DeepSeek-R1-0528-Qwen3-8B, and found that the final one could be a lightweight yet very powerful math CoT verifier. Choosing a lightweight verifier is very important as you may need to verify massive reasoning traces.

Our verification template has been included in the appendix. Please feel free to take it or modify to your cases. In our scenarios, we mainly consider three types of severe mistakes in reasoning CoTs:

Calculation Error: incorrect calculations steps leading to correct answers
Conceptual Error: misunderstanding or misusing mathematical concepts (definitions)
Omission / Incompleteness: missing critical steps yet directly guessing the right answer