|
<!DOCTYPE html> |
|
<html lang="en"> |
|
|
|
<head> |
|
<meta charset="UTF-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
<title>Memorization or Generation of Big Code Model Leaderboard</title> |
|
<link rel="stylesheet" href="style.css"> |
|
<script src="echarts.min.js"></script> |
|
</head> |
|
|
|
<body> |
|
|
|
<section class="section_title"> |
|
<h1> |
|
β <span style="color: rgb(223, 194, 25);">Memorization</span> or |
|
<span style="color: rgb(223, 194, 25);">Generation</span> |
|
of Big |
|
<span style="color: rgb(223, 194, 25);">Code</span> |
|
Model |
|
<span style="color: rgb(223, 194, 25);">Leaderboard</span> |
|
</h1> |
|
|
|
<div class="section_title__imgs"> |
|
<a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank"> |
|
<img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white"> |
|
</a> |
|
<a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank"> |
|
<img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge"> |
|
</a> |
|
</div> |
|
|
|
<div class="section_title__p"> |
|
<p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a> and |
|
<a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">π€ Open LLM-Perf Leaderboard ποΈ</a>, |
|
we compare performance of base code generation models on |
|
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and |
|
<a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and |
|
provide information about the models. |
|
We only compare open pre-trained code models, that people can start from as base models for |
|
their trainings. |
|
</p> |
|
</div> |
|
</section> |
|
|
|
<section class="section_button"> |
|
<button id="btn_evalTable">π Evalution Table</button> |
|
<button id="btn_plot">π Performance Plot</button> |
|
<button id="btn_about">π About</button> |
|
<button id="btn_submit">π Submit results</button> |
|
</section> |
|
|
|
<section class="section_evalTable" id="sec_evalTable"> |
|
<div class="section_evalTable__table"> |
|
<table id="evalTable"> |
|
<colgroup> |
|
<col style="width: 8%"> |
|
<col style="width: 22%"> |
|
<col style="width: 22%"> |
|
<col style="width: 12%"> |
|
<col style="width: 12%"> |
|
<col style="width: 12%"> |
|
<col style="width: 12%"> |
|
</colgroup> |
|
|
|
<thead> |
|
<th rowspan="2">Benchmark</th> |
|
<th rowspan="2">Model |
|
<button class="button_sort" data-direction="desc" data-type="name"></button> |
|
</th> |
|
<th data-direction="desc" rowspan="2" data-type="MGI">MGI, |
|
<br/>Memorization-Generalization Index |
|
<br/>(Ori: Avg. Peak) |
|
<button class="button_sort" data-direction="desc" data-type="MGI"></button> |
|
</th> |
|
<th colspan="2">Pass@1(temp=0)</th> |
|
<th colspan="2">Pass@1(temp=0.8)</th> |
|
<tr> |
|
<th>HumanEval |
|
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button> |
|
</th> |
|
<th>HumanEval-ET |
|
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button> |
|
</th> |
|
<th>HumanEval |
|
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button> |
|
</th> |
|
<th>HumanEval-ET |
|
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button> |
|
</th> |
|
</tr> |
|
</thead> |
|
|
|
<tbody> |
|
|
|
</tbody> |
|
</table> |
|
</table> |
|
<script src="table.js"></script> |
|
</div> |
|
|
|
<div class="section_evalTable__notes"> |
|
<p><strong>Notes</strong> |
|
<p> |
|
<ul> |
|
<li>MGI represents Memorization-Generalization Index, originally referred to as Contamination Ratio.</li> |
|
<li>The scores of instruction-tuned models might be significantly higher on humaneval-python than other |
|
languages. |
|
We use the instruction format of |
|
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and |
|
<a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a>.</li> |
|
<li>For more details check the π About section.</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="section_plot" id="sec_plot"> |
|
<div style="display: flex;"> |
|
<div class="section_plot__div" id="sec_plot__div1"> |
|
<div class="section_plot__btnGroup" id="sec_plot__btnGroup1"> |
|
<button id="btn_temp0_HumanEval"></button> |
|
<span id="span_temp0_HumanEval">HumanEval</span> |
|
<button id="btn_temp0_HumanEval_ET"></button> |
|
<span id="span_temp0_HumanEval_ET">HumanEval-ET</span> |
|
</div> |
|
<div id="sec_plot__chart1" style="width:736.5px; height:600px;"></div> |
|
</div> |
|
|
|
<div class="section_plot__div" id="sec_plot__div2"> |
|
<div class="section_plot__btnGroup" id="sec_plot__btnGroup2"> |
|
<button id="btn_temp0_8_HumanEval"></button> |
|
<span id="span_temp0_8_HumanEval">HumanEval</span> |
|
<button id="btn_temp0_8_HumanEval_ET"></button> |
|
<span id="span_temp0_8_HumanEval_ET">HumanEval-ET</span> |
|
</div> |
|
<div id="sec_plot__chart2" style="width:736.5px; height:600px;"></div> |
|
</div> |
|
</div> |
|
<script src="chart.js"></script> |
|
</section> |
|
|
|
|
|
<section class="section_about" id="sec_about"> |
|
<h2>Context</h2> |
|
<div> |
|
<p>The growing number of code models released by the community necessitates a comprehensive evaluation to |
|
reliably benchmark their capabilities. |
|
Similar to the π€ Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on |
|
multiple programming languages:</p> |
|
<ul> |
|
<li>HumanEval - benchmark for measuring functional correctness for synthesizing programs from |
|
docstrings. It consists of 164 Python programming problems.</li> |
|
<li>MultiPL-E - Translation of HumanEval to 18 programming languages.</li> |
|
<li>Throughput Measurement - In addition to these benchmarks, we also measure model throughput on a |
|
batch size of 1 and 50 to compare their inference speed.</li> |
|
</ul> |
|
<h3>Benchmark & Prompts</h3> |
|
<ul> |
|
<li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li> |
|
<li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python, |
|
where we separate base from instruction models. |
|
We use the original code completion prompts for HumanEval for all base models, but for Instruction |
|
models, |
|
we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text |
|
recommended by the authors of each model |
|
(we also use a max generation length of 2048 instead of 512).</li> |
|
</ul> |
|
<p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts |
|
here.</p> |
|
</div> |
|
<div> |
|
<p>- An exception to this is the Phind models. They seem to follow to base prompts better than the |
|
instruction versions. |
|
Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of |
|
the last newline. |
|
- Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based), |
|
we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing |
|
(instead of HumanEvalSynthesize), |
|
code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p> |
|
<h3>Evalution Parameters</h3> |
|
<ul> |
|
<li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2, |
|
max_length_generation 512, and n_samples=50.</li> |
|
</ul> |
|
<h3>Throughput and Memory Usage</h3> |
|
<ul> |
|
<li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf |
|
Leaderboard. (0 throughput corresponds to OOM).</li> |
|
</ul> |
|
<h3>Scoring and Rankings</h3> |
|
<ul> |
|
<li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each |
|
language and compute num_models - (rank -1), then average this result over all languages.</li> |
|
</ul> |
|
<h3>Miscellaneous</h3> |
|
<ul> |
|
<li>#Languages column represents the number of programming languages included during the pretraining. |
|
UNK means the number of languages is unknown.</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="section_submit" id="sec_submit"> |
|
<h2>How to submit models/results to the leaderboard?</h2> |
|
<div> |
|
<p>We welcome the community to submit evaluation results of new models. These results will be added as |
|
non-verified, the authors are however required to upload their generations in case other members want to |
|
check.</p> |
|
<h3>1 - Running Evaluation</h3> |
|
<p>We wrote a detailed guide for running the evaluation on your model. You can find the it in |
|
bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in |
|
addition to the raw generations and metric files.</p> |
|
<h3>2- Submitting Results π</h3> |
|
<p>To submit your results create a Pull Request in the community tab to add them under the folder |
|
community_results in this repository:</p> |
|
<ul> |
|
<li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li> |
|
<li>Put your json file with grouped scores from the guide, in addition generations folder and metrics |
|
folder in it.</li> |
|
</ul> |
|
<p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace |
|
org and model with those corresponding to the model you evaluated.</p> |
|
</div> |
|
</section> |
|
|
|
|
|
|
|
<footer> |
|
</footer> |
|
|
|
<script src="button.js"></script> |
|
</body> |
|
|
|
</html> |