File size: 5,695 Bytes
a6b7d60
 
b61c5be
 
 
 
 
 
 
 
 
 
 
a6b7d60
 
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
 
 
 
 
 
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
 
 
b61c5be
 
a6b7d60
 
 
 
b61c5be
 
 
 
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
 
 
a6b7d60
 
b61c5be
 
 
a6b7d60
b61c5be
 
 
a6b7d60
b61c5be
 
 
 
a6b7d60
b61c5be
 
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
 
a6b7d60
b61c5be
 
 
 
a6b7d60
b61c5be
 
 
 
 
 
a6b7d60
b61c5be
 
 
a6b7d60
b61c5be
 
 
a6b7d60
b61c5be
 
a6b7d60
b61c5be
a6b7d60
b61c5be
 
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
a6b7d60
b61c5be
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
library_name: transformers
tags:
- reward-model
- prm
- generative reward model
- process supervision
- chain-of-thought
- verification
- math reasoning
- code verification
license: apache-2.0
pipeline_tag: text-generation
---

# Model Card for ThinkPRM-7B

ThinkPRM-7B is a generative Process Reward Model (PRM) based on the R1-Distill-Qwen-7B architecture. It is fine-tuned to perform step-by-step verification of reasoning processes (like mathematical solutions) by generating an explicit verification chain-of-thought (CoT) that involves labeling every step. It is designed to be highly data-efficient, requiring significantly less supervision data than traditional discriminative PRMs while achieving strong performance.

Here's an example of the model output: 


## Model Details

### Model Description

ThinkPRM-7B provides step-level verification scores by generating natural language critiques and correctness judgments for each step in a given solution prefix. It leverages the underlying reasoning capabilities of the base Large Reasoning Model (LRM) and enhances them through fine-tuning on a small (1K examples) dataset of synthetically generated verification CoTs. These synthetic CoTs were produced by prompting QwQ-32B-Preview and filtered against ground-truth step labels from the PRM800K dataset to ensure quality.

The model uses a standard language modeling objective, making it interpretable and allowing it to scale process verification compute by generating longer or multiple verification CoTs. It demonstrated superior performance compared to LLM-as-a-judge and discriminative PRM baselines (based on the same R1-Distill-Qwen-7B model but trained on ~100x more labels) on benchmarks including ProcessBench, MATH-500, AIME '24, GPQA-Diamond, and LiveCodeBench.

- **Finetuned from model [optional]:** [R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)

### Model Sources [optional]

- **Repository:** [Github](https://github.com/mukhal/thinkprm)
- **Paper:** [Process Reward Models that Think (arXiv:2504.16828)](https://arxiv.org/abs/2504.16828)


### Direct Use

ThinkPRM-7B is intended for verifying the correctness of step-by-step reasoning processes. Primary uses include:
- **Scoring Solutions:** Assigning step-level or overall scores to candidate solutions for ranking in Best-of-N sampling or guiding tree search in reasoning tasks.
- **Generating Verification Rationales/CoTs:** Producing detailed chain-of-thought verifications that explain *why* a particular step is correct or incorrect, aiding interpretability.
- **Standalone Verification:** Evaluating the correctness of a given problem-solution pair.

The model has been evaluated on mathematical reasoning (MATH, AIME), scientific QA (GPQA), and code generation (LiveCodeBench). See our paper for more details.

## Limitations

- **Overconfidence:** Generative PRMs like ThinkPRM can sometimes produce scores clustered near 0 or 1, potentially not reflecting true uncertainty
- **Step Label Interference:** The autoregressive nature might cause an early incorrect step judgment to negatively bias the evaluation of subsequent steps.
- **Sensitivity to Formatting/Prompting:** Performance might be sensitive to the exact format of the input solution and the prompt used for verification (though fine-tuning likely reduces this compared to LLM-as-a-judge).

## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams

model_id = "launch/ThinkPRM-7B" # Replace with actual model ID on Hub
tokenizer = AutoTokenizer.from_pretrained(model_id)
llm = LLM(model=model_id, max_model_len=16384)

# Example problem and solution
problem = "Solve for x: 2x + 3 = 7"
prefix = "Step 1: Subtract 3 from both sides: 2x = 4
Step 2: Divide by 2: x = 1"

# Format the prompt
prompt = f"""You are given a math problem and a proposed step-by-step solution:

[Math Problem]

{problem}

[Solution]

{prefix}

Review and critique each step in the proposed solution to determine whether each step is correct. If the solution is incomplete, only verify the provided steps
"""

prompt = tokenizer.apply_chat_template([
    {'role': "user", "content": prompt}
], tokenize=False, add_generation_prompt=True) + "
Let's verify step by step:"

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
    stop=None
)

# Generate the verification
outputs = llm.generate(prompt, sampling_params)
verification_cot = outputs[0].outputs[0].text

print(verification_cot)
"""
Step 1: Subtract 3 from both sides: 2x = 4

Critique: Starting with the equation 2x + 3 = 7, subtracting 3 from both sides is a correct operation to isolate the term with the variable. So, 2x + 3 - 3 = 7 - 3, which simplifies
 to 2x = 4. This step seems correct.

Step 2: Divide by 2: x = 1

Critique: Now, to solve for x, we need to divide both sides of the equation by 2. So, 2x / 2 = 4 / 2, which simplifies to x = 2. Wait a minute, the solution says x = 1, but accordin
g to this calculation, it should be x = 2. This seems incorrect.

Therefore, the first step is correct, but the second step has an error.

**Final Output:**

Let's verify step by step:

Step 1: Subtract 3 from both sides: 2x = 4

Critique: This step is correct. Subtracting 3 from both sides of the equation 2x + 3 = 7 properly isolates the term with the variable, resulting in 2x = 4.

Step 1 is \boxed{correct}

Step 2: Divide by 2: x = 1

Critique: This step is incorrect. Dividing both sides of the equation 2x = 4 by 2 should yield x = 2, not x = 1.

Step 2 is \boxed{incorrect}
</think>
Is the solution correct? No
"""