mrzjy
/

NovelWriting-Outline-PRM-Qwen2.5-0.5B-Reward

@@ -310,6 +310,98 @@ Let's move on nonetheless to see how it actually performs with LLM sampling.
 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
 - Test-Time Scaling
 Since this experiment does not aim to achieve O1-like reasoning behavior, the test-time compute here can be defined simply as a function of `rejection_sampling_size`. Increasing the sampling size during inference leads to higher computational cost, but as expected, it also improves performance according to our PRM.

 Without delving into further reinforcement learning, can we directly apply PRM with our LLMs? The answer is YES!
+Here we experiment with a simplistic variant of MCTS sampling, with the help of the `continue_final_message` feature. The code snippet is provided as follows:
+```python
+def direct_proba(x):
+    s = sum(x)
+    return [e/s for e in x]
+async def _guided_generation(sample):
+    import time
+    start_time = time.time()
+    outlines = []
+    mcts = [{"children": [], "scores": [0], "chosen": 0}]
+    prompt, steps = sample["messages"][0]["content"], []
+    async def request_single_response(messages):
+        response = await a_request_vllm_chat(
+            messages, model_name, temperature=0.7,
+            stop="\n",
+            logit_bias={9: -1e4, 353: -1e4, 334: -1e4, 3070: -1e4},  # prevent some unexpected tokens
+            max_tokens=200,
+            extra_body={
+                "continue_final_message": True,
+                "add_generation_prompt": False,
+                "min_tokens": 5
+            },
+        )
+        return response
+    for i in range(sample["n_chapter"]):
+        history = "\n".join(outlines)
+        if i > 0:
+            history += "\n"
+        if sample["lang"] == "zh":
+            assistant_prefix = f"第{i+1}章："
+        else:
+            assistant_prefix = f"Chapter {i+1}:"
+        messages = [
+            {"role": "system", "content": sample["messages"][0]["content"]},
+            {"role": "assistant", "content": history + assistant_prefix}
+        ]
+        # Perform 4 parallel requests (sampling size = 4)
+        responses = await asyncio.gather(*[request_single_response(messages) for _ in range(4)])
+        responses_content = [response["content"] for response in responses]
+        # sampling based on rewards
+        batch_steps = [outlines + [assistant_prefix + res] for res in responses_content]
+        batch_prompt = [prompt] * len(batch_steps)
+        raw_scores = evaluate_reward(batch_prompt, batch_steps)
+        scores = direct_proba(raw_scores)
+        chosen = random.choices(
+            population=[assistant_prefix + res for res in responses_content],
+            weights=scores,
+            k=1
+        )[0]
+        mcts.append(
+            {
+                "children": [assistant_prefix + res for res in responses_content],
+                "scores": scores,
+                "raw_scores": raw_scores,
+                "chosen": chosen
+            }
+        )
+        current_outline = chosen
+        outlines.append(current_outline)
+    return outlines, mcts, time.time() - start_time
+def evaluate_reward(batch_prompt, batch_steps, separator="\n"):
+    """pipe: assume you have already loaded the PRM pipeline with model checkpoint"""
+    # Add a separator between the prompt and each steps
+    assert len(batch_prompt) == len(batch_steps)
+    batch_text = [separator.join((prompt, *steps)) + separator for prompt, steps in zip(batch_prompt, batch_steps)]
+    preds = [res[-1] for res in pipe(batch_text)]
+    scores = []
+    for pred in preds:
+        score, pred_entity = pred["score"], pred["entity"]
+        # this is tricky (returned score if the proba of the currect class)
+        if pred_entity == "LABEL_0":
+            score = 1 - score
+        scores.append(score)
+    return scores
+```
+- Case
+|Prompt|Outline Generation with Sequential Rejection Sampling|
+|--|--|
+|![sequential_rejection_sampling_zh_prompt.png](image%2Fsequential_rejection_sampling_zh_prompt.png)|![sequential_rejection_sampling_zh.png](image%2Fsequential_rejection_sampling_zh.png)|
 - Test-Time Scaling
 Since this experiment does not aim to achieve O1-like reasoning behavior, the test-time compute here can be defined simply as a function of `rejection_sampling_size`. Increasing the sampling size during inference leads to higher computational cost, but as expected, it also improves performance according to our PRM.