Spaces:

wzxii
/

Memorization-or-Generation-of-Big-Code-Models-Leaderboard

Running

App Files Files Community

wzxii commited on Sep 17

Commit

795f7e4

•

1 Parent(s): 889f4bb

Upload 2 files

Browse files

Files changed (2) hide show

index.html +54 -77
style.css +38 -59

index.html CHANGED Viewed

@@ -31,14 +31,14 @@
         </div>
         <div class="section_title__p">
-            <p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a> and
                 <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">🤗 Open LLM-Perf Leaderboard 🏋️</a>,
-                we compare performance of base code generation models on
                 <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
-                <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and
-                provide information about the models.
-                We compare both open and closed pre-trained code models, that people can start from as base models for
-                their trainings.
             </p>
         </div>
     </section>
@@ -102,7 +102,7 @@
             <p>
             <ul>
                 <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
-                <!-- <li>For more details check the 📝 About section.</li> -->
             </ul>
         </div>
     </section>
@@ -134,85 +134,62 @@
     <section class="section_about" id="sec_about">
-        <h2>Context</h2>
-        <div>
-            <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                 reliably benchmark their capabilities.
                 Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a>,
-                we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p>
-            <ul>
-                <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>
-                     - benchmark for measuring functional correctness for synthesizing programs from docstrings.
-                     It consists of 164 Python programming problems.</li>
-                <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>.</li>
-                <li>MGI - In addition to these benchmarks, we also measure Memorization-Generalization Index.</li>
-            </ul>
-            <!-- <h3>Benchmark & Prompts</h3>
-            <ul>
-                <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
-                <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
-                    where we separate base from instruction models.
-                    We use the original code completion prompts for HumanEval for all base models, but for Instruction
-                    models,
-                    we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text
-                    recommended by the authors of each model
-                    (we also use a max generation length of 2048 instead of 512).</li>
-            </ul>
-            <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
-                here.</p>
-        </div> -->
-        <!-- <div>
-            <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
-                instruction versions.
-                Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
-                the last newline.
-                - Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based),
-                we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing
-                (instead of HumanEvalSynthesize),
-                code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p>
-            <h3>Evalution Parameters</h3>
-            <ul>
-                <li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2,
-                    max_length_generation 512, and n_samples=50.</li>
-            </ul>
-            <h3>Throughput and Memory Usage</h3>
-            <ul>
-                <li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
-                    Leaderboard. (0 throughput corresponds to OOM).</li>
-            </ul>
-            <h3>Scoring and Rankings</h3>
-            <ul>
-                <li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
-                    language and compute num_models - (rank -1), then average this result over all languages.</li>
-            </ul>
-            <h3>Miscellaneous</h3>
-            <ul>
-                <li>#Languages column represents the number of programming languages included during the pretraining.
-                    UNK means the number of languages is unknown.</li>
-            </ul>
-        </div> -->
     </section>
     <section class="section_submit" id="sec_submit">
         <h2>How to submit models/results to the leaderboard?</h2>
         <div>
-            <p>We welcome the community to submit evaluation results of new models. These results will be added as
-                non-verified, the authors are however required to upload their generations in case other members want to
-                check.</p>
-            <h3>1 - Running Evaluation</h3>
-            <p>We wrote a detailed guide for running the evaluation on your model. You can find the it in
-                bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in
-                addition to the raw generations and metric files.</p>
-            <h3>2- Submitting Results 🚀</h3>
-            <p>To submit your results create a Pull Request in the community tab to add them under the folder
-                community_results in this repository:</p>
             <ul>
-                <li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li>
-                <li>Put your json file with grouped scores from the guide, in addition generations folder and metrics
-                    folder in it.</li>
             </ul>
-            <p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace
-                org and model with those corresponding to the model you evaluated.</p>
         </div>
     </section>

         </div>
         <div class="section_title__p">
+            <p>
+                Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a> and
                 <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">🤗 Open LLM-Perf Leaderboard 🏋️</a>,
+                we compare the performance of base code generation models on the
                 <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
+                <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
+                We also measure the Memorization-Generalization Index and provide the results for the models.
+                We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
             </p>
         </div>
     </section>
             <p>
             <ul>
                 <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
+                <li>For more details check the 📝 About section.</li>
             </ul>
         </div>
     </section>
     <section class="section_about" id="sec_about">
+        <h3>Benchmarking and Prompts</h3>
+            <!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
                 reliably benchmark their capabilities.
                 Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a>,
+                we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
+        <ul>
+            <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>:
+                Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
+            </li>
+            <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>:
+                The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
+            </li>
+        </ul>
+        <p>
+            For all models (except for the Starcoder family), we used the original benchmark prompts from <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and added a `&lt;bos&gt;` token before the provided prompt.
+            The maximum generation length was set to the length of the original prompt plus 300 tokens.
+        </p>
+        <p>
+            For the Starcoder family models (such as Starcoder2-7b and Starcoder2-15b),
+            we used the official bigcode-evaluation-harness for generation.
+            More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank" id="a_here">[here]</a>.
+        </p>
+        <h3>Evaluation Parameters</h3>
+        <p>
+            For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively,
+            for the subsequent result calculations. The parameters are set as follows:
+        </p>
+        <ul>
+            <li>top-p=1.0 (default parameter in the transformers library)</li>
+            <li>top-k=50 (default parameter in the transformers library)</li>
+            <li>max_length_generation=len(prompt)+300</li>
+            <li>temperature=0 or temperature=0.8</li>
+            <li>n_samples=50</li>
+        </ul>
+        <h3>Performance Metrics</h3>
+        <ul>
+            <li>pass@k: Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
+            <li>MGI: The average peakedness of the edit distance distribution constructed by the mode samples.</li>
+        </ul>
     </section>
     <section class="section_submit" id="sec_submit">
         <h2>How to submit models/results to the leaderboard?</h2>
         <div>
+            <p>We welcome the community to submit evaluation results of new models.
+                These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
+            </p>
+            <p>
+                To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the
+                <a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
+            </p>
             <ul>
+                <li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span>  for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
+                <li>Put the generation outputs of your modle in it.</li>
             </ul>
+            <p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
         </div>
     </section>

style.css CHANGED Viewed

@@ -53,27 +53,21 @@
     border-bottom: transparent;
     background-color: rgb(255, 255, 255);
     font-size: 18px;
-    /* font-weight: bold; */
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
 }
 #btn_evalTable {
     border-top-left-radius: 5px;
-    /* border-bottom-left-radius: 5px; */
     border-right-width: 0;
     color: #000000;
 }
 #btn_plot {
     border-right-width: 0;
-    /* border-top-right-radius: 5px; */
 }
 #btn_about {
-    display: none
 }
 #btn_submit {
-    display: none;
-    border-top-right-radius: 5px;
-    /* border-bottom-right-radius: 5px; */
-    border-left-width: 0;
 }
 #btn_more {
     border-top-right-radius: 5px;
@@ -84,6 +78,7 @@
 }
 /* button */
 /* evalTable */
 .section_evalTable {
     display: block;
@@ -114,9 +109,6 @@
 .section_evalTable__table td.td_value {
     font-family: 'Courier New', Courier, monospace;
 }
-/* .section_evalTable__table td.td_HumanEval {
-    font-family: Arial, Helvetica, sans-serif;
-} */
 .section_evalTable__table a {
     font-family:Verdana, Geneva, Tahoma, sans-serif;
     text-decoration-color: #0909f8;
@@ -141,10 +133,9 @@
     margin-top: 10px;
     margin-bottom: 0px;
     border-top: transparent;
-    /* border: 1px solid rgb(228, 228, 228); */
 }
 .section_evalTable__notes p {
-    font-size: 16px;
     background-color: #ffffff;
 }
 .section_evalTable__notes ul {
@@ -232,75 +223,72 @@
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
-    padding-bottom: 30px;
     border: 1px solid rgb(228, 228, 228);
-    /* border-top-right-radius: 5px;
-    border-bottom-left-radius: 5px;
-    border-bottom-right-radius: 5px; */
-}
-.section_about h2 {
-    font-size: 25px;
-    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
-    /* display: flex;
-    justify-content: center; */
 }
 .section_about ul {
     list-style: circle;
     padding-left: 20px;
 }
-.section_about ul, p, h3 {
-    /* background-color: rgb(241, 248, 254); */
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
-    line-height: 1.6em
 }
 .section_about h3 {
-    font-size: 18px;
 }
 .section_about a {
     color: #386df4;
-    text-decoration-color: #0909f8;
-    text-decoration-style: dashed;
 }
-.section_about div {
-    margin-top: 10px;
 }
 /* about */
 /* submit */
 .section_submit {
     display: none;
-    /* margin-top: 30px; */
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
-    padding-bottom: 30px;
     border: 1px solid rgb(228, 228, 228);
-    /* border-top-right-radius: 5px;
-    border-bottom-left-radius: 5px;
-    border-bottom-right-radius: 5px; */
 }
 .section_submit h2 {
-    font-size: 25px;
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
     display: flex;
     justify-content: center;
 }
 .section_submit ul {
     list-style: circle;
     padding-left: 20px;
 }
-.section_submit ul, p, h3 {
-    /* background-color: rgb(241, 248, 254); */
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
-    line-height: 1.6em
 }
-.section_submit h3 {
-    font-size: 18px;
 }
-.section_submit div {
-    margin-top: 10px;
-    /* border: dotted gainsboro;
-    border-radius: 10px; */
 }
 /* submit */
@@ -308,30 +296,25 @@
 /* more */
 .section_more {
     display: none;
-    /* margin-top: 30px; */
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
-    padding-bottom: 30px;
     padding-left: 20px;
     padding-right: 20px;
     border: 1px solid rgb(228, 228, 228);
-    /* border-top-right-radius: 5px;
-    border-bottom-left-radius: 5px;
-    border-bottom-right-radius: 5px; */
 }
 .section_more h2 {
-    font-size: 25px;
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
-    /* display: flex;
-    justify-content: center; */
 }
 .section_more p {
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
     margin-top: 10px;
 }
 .section_more ul {
-    margin-top: 10px;
     list-style: circle;
     padding-left: 25px;
     font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
@@ -345,7 +328,3 @@
 }
 /* more */
-/* .u {
-    color: #ad64d4
-} */

     border-bottom: transparent;
     background-color: rgb(255, 255, 255);
     font-size: 18px;
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
 }
 #btn_evalTable {
     border-top-left-radius: 5px;
     border-right-width: 0;
     color: #000000;
 }
 #btn_plot {
     border-right-width: 0;
 }
 #btn_about {
+    border-right-width: 0;
 }
 #btn_submit {
+    border-right-width: 0;
 }
 #btn_more {
     border-top-right-radius: 5px;
 }
 /* button */
 /* evalTable */
 .section_evalTable {
     display: block;
 .section_evalTable__table td.td_value {
     font-family: 'Courier New', Courier, monospace;
 }
 .section_evalTable__table a {
     font-family:Verdana, Geneva, Tahoma, sans-serif;
     text-decoration-color: #0909f8;
     margin-top: 10px;
     margin-bottom: 0px;
     border-top: transparent;
 }
 .section_evalTable__notes p {
+    line-height: 1.8em;
     background-color: #ffffff;
 }
 .section_evalTable__notes ul {
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
+    padding-bottom: 10px;
+    padding-left: 20px;
+    padding-right: 20px;
     border: 1px solid rgb(228, 228, 228);
 }
 .section_about ul {
     list-style: circle;
     padding-left: 20px;
 }
+.section_about ul, p {
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
+    line-height: 1.8em;
 }
 .section_about h3 {
+    padding-top: 12px;
+    padding-bottom: 12px;
+    font-size: 17px;
+    font-family: 'Trebuchet MS', 'Lucida Sans Unicode', 'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
 }
 .section_about a {
     color: #386df4;
+    text-decoration-color: #386df4;
+    text-decoration-style: solid;
 }
+#a_here {
+    text-decoration: none;
 }
 /* about */
 /* submit */
 .section_submit {
     display: none;
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
+    padding-bottom: 10px;
+    padding-left: 20px;
+    padding-right: 20px;
     border: 1px solid rgb(228, 228, 228);
 }
 .section_submit h2 {
+    font-size: 22px;
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
     display: flex;
     justify-content: center;
+    font-weight: 500;
 }
 .section_submit ul {
     list-style: circle;
     padding-left: 20px;
 }
+.section_submit p{
+    margin-top: 12px;
+}
+.section_submit ul, p {
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
+    line-height: 1.8em
 }
+.span_ {
+    background-color: #f3f3f3;
+    font-family:'Courier New', Courier, monospace;
 }
+.section_submit a {
+    color: #386df4;
+    text-decoration-color: #0909f8;
+    text-decoration: solid;
 }
 /* submit */
 /* more */
 .section_more {
     display: none;
     margin-left: 130px;
     margin-right: 130px;
     padding-top: 10px;
+    padding-bottom: 10px;
     padding-left: 20px;
     padding-right: 20px;
     border: 1px solid rgb(228, 228, 228);
 }
 .section_more h2 {
+    font-size: 22px;
     font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    font-weight: 500;
 }
 .section_more p {
     font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
     margin-top: 10px;
+    margin-bottom: 10px;
 }
 .section_more ul {
     list-style: circle;
     padding-left: 25px;
     font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
 }
 /* more */