mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

Token Classification

text-generation-inference

Model card Files Files and versions

mrzjy commited on Jan 14

Commit

11676a6

·

verified ·

1 Parent(s): 16e2463

Update README.md

Files changed (1) hide show

README.md +6 -4

README.md CHANGED Viewed

@@ -310,14 +310,15 @@ Let's move on nonetheless to see how it actually performs with LLM sampling.
 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
-Here we experiment with a simplistic variant of MCTS sampling, with the help of the `continue_final_message` feature. The code snippet is provided as follows:
 ```python
 def direct_proba(x):
     s = sum(x)
     return [e/s for e in x]
-async def _guided_generation(sample):
     import time
     start_time = time.time()
     outlines = []
@@ -338,6 +339,7 @@ async def _guided_generation(sample):
         )
         return response
     for i in range(sample["n_chapter"]):
         history = "\n".join(outlines)
         if i > 0:
@@ -351,8 +353,8 @@ async def _guided_generation(sample):
             {"role": "assistant", "content": history + assistant_prefix}
         ]
-        # Perform 4 parallel requests (sampling size = 4)
-        responses = await asyncio.gather(*[request_single_response(messages) for _ in range(4)])
         responses_content = [response["content"] for response in responses]
         # sampling based on rewards

 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
+Here we experiment with a simplistic variant of MCTS sampling, namely the sequential rejection sampling (Only one path is fully explored at the end), with the help of the `continue_final_message` feature and VLLM server. The code snippet is provided as follows:
 ```python
 def direct_proba(x):
     s = sum(x)
     return [e/s for e in x]
+async def _guided_generation(sample, sampling_size: int):
     import time
     start_time = time.time()
     outlines = []
         )
         return response
+    # sequential rejection sampling
     for i in range(sample["n_chapter"]):
         history = "\n".join(outlines)
         if i > 0:
             {"role": "assistant", "content": history + assistant_prefix}
         ]
+        # Perform parallel requests (didn't apply n/best_of parameter to prevent server OOM)
+        responses = await asyncio.gather(*[request_single_response(messages) for _ in range(sampling_size)])
         responses_content = [response["content"] for response in responses]
         # sampling based on rewards