mrzjy commited on
Commit
87cedaa
·
verified ·
1 Parent(s): 990485d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -207,10 +207,14 @@ There are many PRM related papers one can refer to, [A Roadmap to Reproduce o1](
207
 
208
  The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
209
 
210
- This difference arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
 
 
 
 
211
 
212
  ## 5. Conclusion
213
 
214
- This project provides some minimum hands-on experience on PRM on a specific domain, and it's far from being perfect in terms of training data, model design, evaluation as well as insights.
215
 
216
 
 
207
 
208
  The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
209
 
210
+ This difference of PRM design choice arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
211
+
212
+ On the other hand, however, it's relatively simple to automatically construct negative outlines for an outline PRM training, hence a fast hands-on experience.
213
+
214
+ **Note:** There are automatic ways of turning ORM into a PRM (e.g., [Free Process Rewards without Process Labels](https://github.com/PRIME-RL/ImplicitPRM)), but it's beyond our discussion now.
215
 
216
  ## 5. Conclusion
217
 
218
+ This project provides some minimum hands-on experience with PRM in a specific domain. However, it's important to note that it is far from perfect in terms of training data, model design, evaluation, and the insights gained.
219
 
220