Spaces:

wzxii
/

Memorization-or-Generation-of-Big-Code-Models-Leaderboard

Running

App Files Files Community

wzxii commited on Aug 25

Commit

cf11127

•

1 Parent(s): 52e2576

Update index.html

Browse files

Files changed (1) hide show

index.html +234 -19

index.html CHANGED Viewed

@@ -1,19 +1,234 @@
-<!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
-</html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Memorization or Generation of Big Code Model Leaderboard</title>
+    <link rel="stylesheet" href="style.css">
+    <script src="echarts.min.js"></script>
+</head>
+<body>
+    <section class="section_title">
+        <h1>
+            ⭐ <span style="color: rgb(223, 194, 25);">Memorization</span> or
+            <span style="color: rgb(223, 194, 25);">Generation</span>
+             of Big
+             <span style="color: rgb(223, 194, 25);">Code</span>
+              Model
+              <span style="color: rgb(223, 194, 25);">Leaderboard</span>
+        </h1>
+        <div class="section_title__imgs">
+            <a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
+                <img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
+            </a>
+            <a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
+                <img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
+            </a>
+        </div>
+        <div class="section_title__p">
+            <p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">🤗 Open LLM Leaderboard</a> and
+                <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">🤗 Open LLM-Perf Leaderboard 🏋️</a>,
+                we compare performance of base code generation models on
+                <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
+                <a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and
+                provide information about the models.
+                We only compare open pre-trained code models, that people can start from as base models for
+                their trainings.
+            </p>
+        </div>
+    </section>
+    <section class="section_button">
+        <button id="btn_evalTable">🔍 Evalution Table</button>
+        <button id="btn_plot">📊 Performance Plot</button>
+        <button id="btn_about">📝 About</button>
+        <button id="btn_submit">🚀 Submit results</button>
+    </section>
+    <section class="section_evalTable" id="sec_evalTable">
+        <div class="section_evalTable__table">
+            <table id="evalTable">
+                <colgroup>
+                    <col style="width: 8%">
+                    <col style="width: 22%">
+                    <col style="width: 22%">
+                    <col style="width: 12%">
+                    <col style="width: 12%">
+                    <col style="width: 12%">
+                    <col style="width: 12%">
+                </colgroup>
+                <thead>
+                    <th rowspan="2">Benchmark</th>
+                    <th rowspan="2">Model
+                        <button class="button_sort" data-direction="desc" data-type="name"></button>
+                    </th>
+                    <th data-direction="desc" rowspan="2" data-type="MGI">MGI,
+                        <br/>Memorization-Generalization Index
+                        <br/>(Ori: Avg. Peak)
+                        <button class="button_sort" data-direction="desc" data-type="MGI"></button>
+                    </th>
+                    <th colspan="2">Pass@1(temp=0)</th>
+                    <th colspan="2">Pass@1(temp=0.8)</th>
+                    <tr>
+                        <th>HumanEval
+                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
+                        </th>
+                        <th>HumanEval-ET
+                            <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
+                        </th>
+                        <th>HumanEval
+                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
+                        </th>
+                        <th>HumanEval-ET
+                            <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
+                        </th>
+                    </tr>
+                </thead>
+                <tbody>
+                </tbody>
+            </table>
+            </table>
+            <script src="table.js"></script>
+        </div>
+        <div class="section_evalTable__notes">
+            <p><strong>Notes</strong>
+            <p>
+            <ul>
+                <li>MGI represents Memorization-Generalization Index, originally referred to as Contamination Ratio.</li>
+                <li>The scores of instruction-tuned models might be significantly higher on humaneval-python than other
+                    languages.
+                    We use the instruction format of
+                    <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
+                    <a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a>.</li>
+                <li>For more details check the 📝 About section.</li>
+            </ul>
+        </div>
+    </section>
+    <section class="section_plot" id="sec_plot">
+        <div style="display: flex;">
+            <div class="section_plot__div" id="sec_plot__div1">
+                <div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
+                    <button id="btn_temp0_HumanEval"></button>
+                    <span id="span_temp0_HumanEval">HumanEval</span>
+                    <button id="btn_temp0_HumanEval_ET"></button>
+                    <span id="span_temp0_HumanEval_ET">HumanEval-ET</span>
+                </div>
+                <div id="sec_plot__chart1" style="width:736.5px; height:600px;"></div>
+            </div>
+            <div class="section_plot__div" id="sec_plot__div2">
+                <div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
+                    <button id="btn_temp0_8_HumanEval"></button>
+                    <span id="span_temp0_8_HumanEval">HumanEval</span>
+                    <button id="btn_temp0_8_HumanEval_ET"></button>
+                    <span id="span_temp0_8_HumanEval_ET">HumanEval-ET</span>
+                </div>
+                <div id="sec_plot__chart2" style="width:736.5px; height:600px;"></div>
+            </div>
+        </div>
+        <script src="chart.js"></script>
+    </section>
+    <section class="section_about" id="sec_about">
+        <h2>Context</h2>
+        <div>
+            <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
+                reliably benchmark their capabilities.
+                Similar to the 🤗 Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on
+                multiple programming languages:</p>
+            <ul>
+                <li>HumanEval - benchmark for measuring functional correctness for synthesizing programs from
+                    docstrings. It consists of 164 Python programming problems.</li>
+                <li>MultiPL-E - Translation of HumanEval to 18 programming languages.</li>
+                <li>Throughput Measurement - In addition to these benchmarks, we also measure model throughput on a
+                    batch size of 1 and 50 to compare their inference speed.</li>
+            </ul>
+            <h3>Benchmark & Prompts</h3>
+            <ul>
+                <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
+                <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
+                    where we separate base from instruction models.
+                    We use the original code completion prompts for HumanEval for all base models, but for Instruction
+                    models,
+                    we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text
+                    recommended by the authors of each model
+                    (we also use a max generation length of 2048 instead of 512).</li>
+            </ul>
+            <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
+                here.</p>
+        </div>
+        <div>
+            <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
+                instruction versions.
+                Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
+                the last newline.
+                - Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based),
+                we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing
+                (instead of HumanEvalSynthesize),
+                code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p>
+            <h3>Evalution Parameters</h3>
+            <ul>
+                <li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2,
+                    max_length_generation 512, and n_samples=50.</li>
+            </ul>
+            <h3>Throughput and Memory Usage</h3>
+            <ul>
+                <li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
+                    Leaderboard. (0 throughput corresponds to OOM).</li>
+            </ul>
+            <h3>Scoring and Rankings</h3>
+            <ul>
+                <li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
+                    language and compute num_models - (rank -1), then average this result over all languages.</li>
+            </ul>
+            <h3>Miscellaneous</h3>
+            <ul>
+                <li>#Languages column represents the number of programming languages included during the pretraining.
+                    UNK means the number of languages is unknown.</li>
+            </ul>
+        </div>
+    </section>
+    <section class="section_submit" id="sec_submit">
+        <h2>How to submit models/results to the leaderboard?</h2>
+        <div>
+            <p>We welcome the community to submit evaluation results of new models. These results will be added as
+                non-verified, the authors are however required to upload their generations in case other members want to
+                check.</p>
+            <h3>1 - Running Evaluation</h3>
+            <p>We wrote a detailed guide for running the evaluation on your model. You can find the it in
+                bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in
+                addition to the raw generations and metric files.</p>
+            <h3>2- Submitting Results 🚀</h3>
+            <p>To submit your results create a Pull Request in the community tab to add them under the folder
+                community_results in this repository:</p>
+            <ul>
+                <li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li>
+                <li>Put your json file with grouped scores from the guide, in addition generations folder and metrics
+                    folder in it.</li>
+            </ul>
+            <p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace
+                org and model with those corresponding to the model you evaluated.</p>
+        </div>
+    </section>
+    <footer>
+    </footer>
+    <script src="button.js"></script>
+</body>
+</html>