mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

Token Classification

text-generation-inference

Model card Files Files and versions

mrzjy commited on Jan 13

Commit

87cedaa

·

verified ·

1 Parent(s): 990485d

Update README.md

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -207,10 +207,14 @@ There are many PRM related papers one can refer to, [A Roadmap to Reproduce o1](
 The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
-This difference arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
 ## 5. Conclusion
-This project provides some minimum hands-on experience on PRM on a specific domain, and it's far from being perfect in terms of training data, model design, evaluation as well as insights.

 The main difference between a PRM for O1-like models and a PRM for this project, is that there is no reasoning process in this project at all. The process or step is defined directly as each line of the final results (instead of CoT lines).
+This difference of PRM design choice arises because obtaining step-wise reward signals for reasoning in creative writing is inherently challenging, with frequent ambiguity and subjectivity. Annotators may struggle to determine whether a particular reasoning step in the creative process is good or bad. Unlike math problems, where correctness is well-defined, creative writing allows for **open** valid paths—each leading to a unique outcome, as "all roads lead to Rome."
+On the other hand, however, it's relatively simple to automatically construct negative outlines for an outline PRM training, hence a fast hands-on experience.
+**Note:** There are automatic ways of turning ORM into a PRM (e.g., [Free Process Rewards without Process Labels](https://github.com/PRIME-RL/ImplicitPRM)), but it's beyond our discussion now.
 ## 5. Conclusion
+This project provides some minimum hands-on experience with PRM in a specific domain. However, it's important to note that it is far from perfect in terms of training data, model design, evaluation, and the insights gained.