Spaces:

double-ai
/

FormulaOne-Leaderboard

Running on CPU Upgrade

App Files Files Community

galb-dai commited on 13 days ago

Commit

8ba6224

1 Parent(s): 41d25c2

Update.

Browse files

Files changed (1) hide show

src/about.py +5 -5

src/about.py CHANGED Viewed

@@ -12,7 +12,7 @@ WHAT_IS_F1_HTML_TOP = f"""
     <p class="text-lg mb-4 f1-p">We believe that existing benchmarks fail to capture the deep reasoning skills required for complex, research-level algorithmic problems. To address this gap, <a href="{PAPER_URL}" target="_blank" rel="noopener noreferrer" class="f1-a">we introduce <strong>FormulaOne</strong></a>.</p>
-    <p class="mb-4 f1-p"><strong>FormulaOne</strong> consists of 220 novel dynamic programming problems over graphs. The problems are organised into three categories, ranging from moderate difficulty and all the way up to research-level.</p>
     <!-- Clean, centered "table" using a single grid -->
     <div class="f1-grid-wrap" role="region" aria-label="FormulaOne categories">
@@ -50,7 +50,7 @@ WHAT_IS_F1_HTML_BOTTOM_A_BEFORE_TABS = """
 <div class="f1-container">
   <section>
     <p class="mb-4 f1-p">The latter category is incredibly demanding, requiring resolution of many points of uncertainty, and involving an array of reasoning steps, including topological and geometric insight, knowledge of mathematical domains such as extremal graph theory and logic, combinatorial considerations, precise implementation, and more.</p>
-    <p class="f1-p">Despite <a href="https://epoch.ai/frontiermath" target="_blank" rel="noopener noreferrer" class="f1-a">impressive</a> <a href="https://artificialanalysis.ai/evaluations/gpqa-diamond" target="_blank" rel="noopener noreferrer" class="f1-a">performance</a> on existing benchmarks, presently <strong>no model solves even a single Tier 2 problem</strong>.</p>
   </section>
   <section>
@@ -82,14 +82,14 @@ WHAT_IS_F1_HTML_AFTER_VIDEO = """
         <li class="f1-li"><strong>Consistency:</strong> The solution must produce the same output for a given graph, regardless of the specific tree decomposition.</li>
         <li class="f1-li"><strong>Efficiency:</strong> The solution must be truly <a href="https://en.wikipedia.org/wiki/Parameterized_complexity" target="_blank" rel="noopener noreferrer" class="f1-a">fixed-parameter linear</a>.</li>
     </ul>
-    <p class="mb-4 f1-p">To support research and encourage community contributions, the <code>FormulaOne-Warmup</code> dataset is released as a public resource for training and fine-tuning models. The complete test suite for all 100 Warmup problems is available, alongside a standalone evaluation environment, in our <a href="https://github.com/double-ai/formulaone-dataset/tree/main" target="_blank" rel="noopener noreferrer" class="f1-a">GitHub repository</a>.</p>
     <p class="f1-p">To maintain the integrity of the core benchmark, only a minimal subset of tests is released for the Deeper and Deepest Tier problems. Solutions submitted for evaluation on our benchmark are evaluated against a withheld comprehensive test-suite.</p>
 """
 # Evaluation: begins the "Model Accuracy" subsection and the Warmup paragraph, up to (but not including) the Warmup figure.
 WHAT_IS_F1_HTML_EVAL_BEFORE_WARMUPFIG = """
     <h2 class="f1-h2">Model Accuracy</h2>
-    <p class="mb-4 f1-p">On the <strong>FormulaOne-Warmup</strong> problems, frontier models perform reasonably well. This confirms they have a foundational capability for these types of algorithmic tasks, in other words, the tasks are squarely in-distribution.</p>
     <!-- warmup_performance figure inserted via gr.Image in app.py -->
 """
@@ -101,7 +101,7 @@ WHAT_IS_F1_HTML_AFTER_WARMUPFIG = """
 # Tail after Deeper figure (closes evaluation section + container)
 WHAT_IS_F1_HTML_AFTER_TIER1FIG_TAIL = """
-    <p class="f1-p">This trend culminates in <strong>Tier 2</strong>, where the difficulty is characteristic of exploratory research problems. On this set of 20 problems, no current frontier model solves even a single one. This result starkly illustrates the gap that remains between high performance on existing benchmarks and the deep algorithmic reasoning required for truly complex problems.</p>
   </section>
 </div>
 """

     <p class="text-lg mb-4 f1-p">We believe that existing benchmarks fail to capture the deep reasoning skills required for complex, research-level algorithmic problems. To address this gap, <a href="{PAPER_URL}" target="_blank" rel="noopener noreferrer" class="f1-a">we introduce <strong>FormulaOne</strong></a>.</p>
+    <p class="mb-4 f1-p"><strong>FormulaOne</strong> consists of novel dynamic programming problems over graphs. The problems are organised into three categories, ranging from moderate difficulty and all the way up to research-level.</p>
     <!-- Clean, centered "table" using a single grid -->
     <div class="f1-grid-wrap" role="region" aria-label="FormulaOne categories">
 <div class="f1-container">
   <section>
     <p class="mb-4 f1-p">The latter category is incredibly demanding, requiring resolution of many points of uncertainty, and involving an array of reasoning steps, including topological and geometric insight, knowledge of mathematical domains such as extremal graph theory and logic, combinatorial considerations, precise implementation, and more.</p>
+    <p class="f1-p">Despite <a href="https://epoch.ai/frontiermath" target="_blank" rel="noopener noreferrer" class="f1-a">impressive</a> <a href="https://artificialanalysis.ai/evaluations/gpqa-diamond" target="_blank" rel="noopener noreferrer" class="f1-a">performance</a> on existing benchmarks, presently <strong>no model solves even a single 'Deepest Tier' problem</strong>.</p>
   </section>
   <section>
         <li class="f1-li"><strong>Consistency:</strong> The solution must produce the same output for a given graph, regardless of the specific tree decomposition.</li>
         <li class="f1-li"><strong>Efficiency:</strong> The solution must be truly <a href="https://en.wikipedia.org/wiki/Parameterized_complexity" target="_blank" rel="noopener noreferrer" class="f1-a">fixed-parameter linear</a>.</li>
     </ul>
+    <p class="mb-4 f1-p">To support research and encourage community contributions, the <code>FormulaOne-Shallow</code> ("warmup") dataset is released as a public resource for training and fine-tuning models. The complete test suite for all 100 'Shallow' problems is available, alongside a standalone evaluation environment, in our <a href="https://github.com/double-ai/formulaone-dataset/tree/main" target="_blank" rel="noopener noreferrer" class="f1-a">GitHub repository</a>.</p>
     <p class="f1-p">To maintain the integrity of the core benchmark, only a minimal subset of tests is released for the Deeper and Deepest Tier problems. Solutions submitted for evaluation on our benchmark are evaluated against a withheld comprehensive test-suite.</p>
 """
 # Evaluation: begins the "Model Accuracy" subsection and the Warmup paragraph, up to (but not including) the Warmup figure.
 WHAT_IS_F1_HTML_EVAL_BEFORE_WARMUPFIG = """
     <h2 class="f1-h2">Model Accuracy</h2>
+    <p class="mb-4 f1-p">On the <strong>FormulaOne-Shallow</strong> problems, frontier models perform reasonably well. This confirms they have a foundational capability for these types of algorithmic tasks, in other words, the tasks are squarely in-distribution.</p>
     <!-- warmup_performance figure inserted via gr.Image in app.py -->
 """
 # Tail after Deeper figure (closes evaluation section + container)
 WHAT_IS_F1_HTML_AFTER_TIER1FIG_TAIL = """
+    <p class="f1-p">This trend culminates in <strong>Deepest Tier</strong>, where the difficulty is characteristic of exploratory research problems. On this set of 20 problems, no current frontier model solves even a single one. This result starkly illustrates the gap that remains between high performance on existing benchmarks and the deep algorithmic reasoning required for truly complex problems.</p>
   </section>
 </div>
 """