Spaces:

wzxii
/

Memorization-or-Generation-of-Big-Code-Models-Leaderboard

Running

App Files Files Community

wzxii commited on Aug 29

Commit

7d88398

•

1 Parent(s): 3f0883c

Upload 2 files

Browse files

Files changed (2) hide show

index.html +11 -11
style.css +4 -3

index.html CHANGED Viewed

@@ -137,16 +137,16 @@
         <div>
             <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                 reliably benchmark their capabilities.
-                Similar to the 🤗 Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on
-                multiple programming languages:</p>
             <ul>
-                <li>HumanEval - benchmark for measuring functional correctness for synthesizing programs from
-                    docstrings. It consists of 164 Python programming problems.</li>
-                <li>MultiPL-E - Translation of HumanEval to 18 programming languages.</li>
-                <li>Throughput Measurement - In addition to these benchmarks, we also measure model throughput on a
-                    batch size of 1 and 50 to compare their inference speed.</li>
             </ul>
-            <h3>Benchmark & Prompts</h3>
             <ul>
                 <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
                 <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
@@ -159,8 +159,8 @@
             </ul>
             <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
                 here.</p>
-        </div>
-        <div>
             <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
                 instruction versions.
                 Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
@@ -189,7 +189,7 @@
                 <li>#Languages column represents the number of programming languages included during the pretraining.
                     UNK means the number of languages is unknown.</li>
             </ul>
-        </div>
     </section>
     <section class="section_submit" id="sec_submit">

         <div>
             <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                 reliably benchmark their capabilities.
+                Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a>,
+                we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p>
             <ul>
+                <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>
+                     - benchmark for measuring functional correctness for synthesizing programs from docstrings.
+                     It consists of 164 Python programming problems.</li>
+                <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>.</li>
+                <li>MGI - In addition to these benchmarks, we also measure Memorization-Generalization Index.</li>
             </ul>
+            <!-- <h3>Benchmark & Prompts</h3>
             <ul>
                 <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
                 <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
             </ul>
             <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
                 here.</p>
+        </div> -->
+        <!-- <div>
             <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
                 instruction versions.
                 Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
                 <li>#Languages column represents the number of programming languages included during the pretraining.
                     UNK means the number of languages is unknown.</li>
             </ul>
+        </div> -->
     </section>
     <section class="section_submit" id="sec_submit">

style.css CHANGED Viewed

@@ -251,9 +251,10 @@
 .section_about h3 {
     font-size: 18px;
 }
-.section_about img {
-    margin-top: 20px;
-    width:900px;
 }
 .section_about div {
     margin-top: 10px;

 .section_about h3 {
     font-size: 18px;
 }
+.section_about a {
+    color: #386df4;
+    text-decoration-color: #0909f8;
+    text-decoration-style: dashed;
 }
 .section_about div {
     margin-top: 10px;