Windy0822
/

PQM

Reinforcement Learning

Model card Files Files and versions Community

PQM / README.md

Windy0822's picture

Create README.md

92e047b verified 4 months ago

|

980 Bytes

	---
	license: mit
	datasets:
	- peiyi9979/Math-Shepherd
	language:
	- en
	base_model:
	- deepseek-ai/deepseek-math-7b-base
	pipeline_tag: reinforcement-learning
	---
	## Introduction
	<div align="center">
	<img src="figures/PQM.png" width="822px">
	</div>

	We present a new framework for PRM by framing it as a $Q$-value ranking problem, providing a theoretical basis for reward modeling that captures inter-dependencies among reasoning states.
	We also show that prior classification-based PRM can be cast as a special case under our framework.
	We validate its effectiveness through comprehensive experiments and ablation studies on a wide range of sampling policies, LLM backbones, and different test sets.

	## Checkpoints & Evaluation Data

	We upload the sampling corpus of three policies to folder `./eval_data` of current repository.

	The checkpoints are `model.safetensors` in `./zeta-2` and `./zeta-4`, corresponding to the two hyperparameter settings in our main experiments.