|
---
|
|
license: mit
|
|
datasets:
|
|
- peiyi9979/Math-Shepherd
|
|
language:
|
|
- en
|
|
base_model:
|
|
- deepseek-ai/deepseek-math-7b-base
|
|
pipeline_tag: reinforcement-learning
|
|
---
|
|
## Introduction |
|
<div align="center"> |
|
<img src="figures/PQM.png" width="822px"> |
|
</div> |
|
|
|
We present a new framework for PRM by framing it as a $Q$-value ranking problem, providing a theoretical basis for reward modeling that captures inter-dependencies among reasoning states. |
|
We also show that prior classification-based PRM can be cast as a special case under our framework. |
|
We validate its effectiveness through comprehensive experiments and ablation studies on a wide range of sampling policies, LLM backbones, and different test sets. |
|
|
|
## Checkpoints & Evaluation Data |
|
|
|
We upload the sampling corpus of three policies to folder `./eval_data` of current repository. |
|
|
|
The checkpoints are `model.safetensors` in `./zeta-2` and `./zeta-4`, corresponding to the two hyperparameter settings in our main experiments. |