lewtun HF staff commited on
Commit
3ceb404
·
1 Parent(s): 2f11f71

Add reference to AlphaMath

Browse files
Files changed (1) hide show
  1. app/src/index.html +1 -1
app/src/index.html CHANGED
@@ -100,7 +100,7 @@
100
  <ul>
101
  <li><strong>Best-of-N: </strong>Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.</li>
102
  <li><strong>Beam search: </strong>A systematic search method that explores the solution space, often combined with a <em>process reward model (PRM)</em><d-cite key="prm"></d-cite> to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a <em>sequence </em>of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.</li>
103
- <li><strong>Diverse verifier tree search (DVTS):</strong> An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM.<d-footnote>DVTS is similar to <a href="https://huggingface.co/papers/1610.02424">diverse beam search (DBS)</a> with the main difference that beams share a common prefix in DBS and no sampling is used. DVTS is also similar to <a href="https://huggingface.co/papers/2306.09896">code repair trees</a>, although it is not restricted to code generation models and discrete verifiers.</d-footnote> This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.</li>
104
  </ul>
105
 
106
  <p id="15a1384e-bcac-803c-bc89-ed15f18eafdc" class="">With an understanding of the key search strategies, let’s move on to how we evaluated them in practice.</p>
 
100
  <ul>
101
  <li><strong>Best-of-N: </strong>Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.</li>
102
  <li><strong>Beam search: </strong>A systematic search method that explores the solution space, often combined with a <em>process reward model (PRM)</em><d-cite key="prm"></d-cite> to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a <em>sequence </em>of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.</li>
103
+ <li><strong>Diverse verifier tree search (DVTS):</strong> An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM.<d-footnote>DVTS is similar to <a href="https://huggingface.co/papers/1610.02424">diverse beam search (DBS)</a> with the main difference that beams share a common prefix in DBS and no sampling is used. DVTS is also similar to <a href="https://huggingface.co/papers/2306.09896">code repair trees</a>, although it is not restricted to code generation models and discrete verifiers. After the publication of this blog post, we were made aware of <a href="https://huggingface.co/papers/2405.03553">step-level beam search</a>, which is most similar to DVTS and uses a value head to predict the most promising steps instead of a PRM.</d-footnote> This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.</li>
104
  </ul>
105
 
106
  <p id="15a1384e-bcac-803c-bc89-ed15f18eafdc" class="">With an understanding of the key search strategies, let’s move on to how we evaluated them in practice.</p>