---
license: mit
datasets:
- peiyi9979/Math-Shepherd
language:
- en
base_model:
- deepseek-ai/deepseek-math-7b-base
pipeline_tag: reinforcement-learning
---
## Introduction
We present a new framework for PRM by framing it as a $Q$-value ranking problem, providing a theoretical basis for reward modeling that captures inter-dependencies among reasoning states.
We also show that prior classification-based PRM can be cast as a special case under our framework.
We validate its effectiveness through comprehensive experiments and ablation studies on a wide range of sampling policies, LLM backbones, and different test sets.
## Checkpoints & Evaluation Data
We upload the sampling corpus of three policies to folder `./eval_data` of current repository.
The checkpoints are `model.safetensors` in `./zeta-2` and `./zeta-4`, corresponding to the two hyperparameter settings in our main experiments.