File size: 1,065 Bytes
ebe7812
 
 
 
 
 
 
 
 
 
92e047b
 
ebe7812
92e047b
 
 
 
 
 
 
 
 
 
dfd0aa5
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
---
license: mit
datasets:
- peiyi9979/Math-Shepherd
language:
- en
base_model:
- deepseek-ai/deepseek-math-7b-base
pipeline_tag: reinforcement-learning
---
## Introduction
<div align="center">
<img src="PQM.png" width="822px">
</div>

We present a new framework for PRM by framing it as a $Q$-value ranking problem, providing a theoretical basis for reward modeling that captures inter-dependencies among reasoning states.
We also show that prior classification-based PRM can be cast as a special case under our framework.
We validate its effectiveness through comprehensive experiments and ablation studies on a wide range of sampling policies, LLM backbones, and different test sets. 

## Checkpoints & Evaluation Data

We upload the sampling corpus of three policies to folder `./eval_data` of current repository.

The checkpoints are `model.safetensors` in `./zeta-2` and `./zeta-4`, corresponding to the two hyperparameter settings in our main experiments.

You can download them by `huggingface-cli download Windy0822/PQM <filename> --local-dir <local path>`