Windy0822 commited on
Commit
92e047b
·
verified ·
1 Parent(s): 4624df0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -0
README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - peiyi9979/Math-Shepherd
5
+ language:
6
+ - en
7
+ base_model:
8
+ - deepseek-ai/deepseek-math-7b-base
9
+ pipeline_tag: reinforcement-learning
10
+ ---
11
+ ## Introduction
12
+ <div align="center">
13
+ <img src="figures/PQM.png" width="822px">
14
+ </div>
15
+
16
+ We present a new framework for PRM by framing it as a $Q$-value ranking problem, providing a theoretical basis for reward modeling that captures inter-dependencies among reasoning states.
17
+ We also show that prior classification-based PRM can be cast as a special case under our framework.
18
+ We validate its effectiveness through comprehensive experiments and ablation studies on a wide range of sampling policies, LLM backbones, and different test sets.
19
+
20
+ ## Checkpoints & Evaluation Data
21
+
22
+ We upload the sampling corpus of three policies to folder `./eval_data` of current repository.
23
+
24
+ The checkpoints are `model.safetensors` in `./zeta-2` and `./zeta-4`, corresponding to the two hyperparameter settings in our main experiments.