Update index.html
Browse files- index.html +234 -19
index.html
CHANGED
@@ -1,19 +1,234 @@
|
|
1 |
-
<!
|
2 |
-
<html>
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
</
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html lang="en">
|
3 |
+
|
4 |
+
<head>
|
5 |
+
<meta charset="UTF-8">
|
6 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
7 |
+
<title>Memorization or Generation of Big Code Model Leaderboard</title>
|
8 |
+
<link rel="stylesheet" href="style.css">
|
9 |
+
<script src="echarts.min.js"></script>
|
10 |
+
</head>
|
11 |
+
|
12 |
+
<body>
|
13 |
+
|
14 |
+
<section class="section_title">
|
15 |
+
<h1>
|
16 |
+
β <span style="color: rgb(223, 194, 25);">Memorization</span> or
|
17 |
+
<span style="color: rgb(223, 194, 25);">Generation</span>
|
18 |
+
of Big
|
19 |
+
<span style="color: rgb(223, 194, 25);">Code</span>
|
20 |
+
Model
|
21 |
+
<span style="color: rgb(223, 194, 25);">Leaderboard</span>
|
22 |
+
</h1>
|
23 |
+
|
24 |
+
<div class="section_title__imgs">
|
25 |
+
<a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
|
26 |
+
<img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
|
27 |
+
</a>
|
28 |
+
<a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
|
29 |
+
<img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
|
30 |
+
</a>
|
31 |
+
</div>
|
32 |
+
|
33 |
+
<div class="section_title__p">
|
34 |
+
<p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a> and
|
35 |
+
<a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">π€ Open LLM-Perf Leaderboard ποΈ</a>,
|
36 |
+
we compare performance of base code generation models on
|
37 |
+
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
|
38 |
+
<a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and
|
39 |
+
provide information about the models.
|
40 |
+
We only compare open pre-trained code models, that people can start from as base models for
|
41 |
+
their trainings.
|
42 |
+
</p>
|
43 |
+
</div>
|
44 |
+
</section>
|
45 |
+
|
46 |
+
<section class="section_button">
|
47 |
+
<button id="btn_evalTable">π Evalution Table</button>
|
48 |
+
<button id="btn_plot">π Performance Plot</button>
|
49 |
+
<button id="btn_about">π About</button>
|
50 |
+
<button id="btn_submit">π Submit results</button>
|
51 |
+
</section>
|
52 |
+
|
53 |
+
<section class="section_evalTable" id="sec_evalTable">
|
54 |
+
<div class="section_evalTable__table">
|
55 |
+
<table id="evalTable">
|
56 |
+
<colgroup>
|
57 |
+
<col style="width: 8%">
|
58 |
+
<col style="width: 22%">
|
59 |
+
<col style="width: 22%">
|
60 |
+
<col style="width: 12%">
|
61 |
+
<col style="width: 12%">
|
62 |
+
<col style="width: 12%">
|
63 |
+
<col style="width: 12%">
|
64 |
+
</colgroup>
|
65 |
+
|
66 |
+
<thead>
|
67 |
+
<th rowspan="2">Benchmark</th>
|
68 |
+
<th rowspan="2">Model
|
69 |
+
<button class="button_sort" data-direction="desc" data-type="name"></button>
|
70 |
+
</th>
|
71 |
+
<th data-direction="desc" rowspan="2" data-type="MGI">MGI,
|
72 |
+
<br/>Memorization-Generalization Index
|
73 |
+
<br/>(Ori: Avg. Peak)
|
74 |
+
<button class="button_sort" data-direction="desc" data-type="MGI"></button>
|
75 |
+
</th>
|
76 |
+
<th colspan="2">Pass@1(temp=0)</th>
|
77 |
+
<th colspan="2">Pass@1(temp=0.8)</th>
|
78 |
+
<tr>
|
79 |
+
<th>HumanEval
|
80 |
+
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
|
81 |
+
</th>
|
82 |
+
<th>HumanEval-ET
|
83 |
+
<button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
|
84 |
+
</th>
|
85 |
+
<th>HumanEval
|
86 |
+
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
|
87 |
+
</th>
|
88 |
+
<th>HumanEval-ET
|
89 |
+
<button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
|
90 |
+
</th>
|
91 |
+
</tr>
|
92 |
+
</thead>
|
93 |
+
|
94 |
+
<tbody>
|
95 |
+
|
96 |
+
</tbody>
|
97 |
+
</table>
|
98 |
+
</table>
|
99 |
+
<script src="table.js"></script>
|
100 |
+
</div>
|
101 |
+
|
102 |
+
<div class="section_evalTable__notes">
|
103 |
+
<p><strong>Notes</strong>
|
104 |
+
<p>
|
105 |
+
<ul>
|
106 |
+
<li>MGI represents Memorization-Generalization Index, originally referred to as Contamination Ratio.</li>
|
107 |
+
<li>The scores of instruction-tuned models might be significantly higher on humaneval-python than other
|
108 |
+
languages.
|
109 |
+
We use the instruction format of
|
110 |
+
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
|
111 |
+
<a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a>.</li>
|
112 |
+
<li>For more details check the π About section.</li>
|
113 |
+
</ul>
|
114 |
+
</div>
|
115 |
+
</section>
|
116 |
+
|
117 |
+
<section class="section_plot" id="sec_plot">
|
118 |
+
<div style="display: flex;">
|
119 |
+
<div class="section_plot__div" id="sec_plot__div1">
|
120 |
+
<div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
|
121 |
+
<button id="btn_temp0_HumanEval"></button>
|
122 |
+
<span id="span_temp0_HumanEval">HumanEval</span>
|
123 |
+
<button id="btn_temp0_HumanEval_ET"></button>
|
124 |
+
<span id="span_temp0_HumanEval_ET">HumanEval-ET</span>
|
125 |
+
</div>
|
126 |
+
<div id="sec_plot__chart1" style="width:736.5px; height:600px;"></div>
|
127 |
+
</div>
|
128 |
+
|
129 |
+
<div class="section_plot__div" id="sec_plot__div2">
|
130 |
+
<div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
|
131 |
+
<button id="btn_temp0_8_HumanEval"></button>
|
132 |
+
<span id="span_temp0_8_HumanEval">HumanEval</span>
|
133 |
+
<button id="btn_temp0_8_HumanEval_ET"></button>
|
134 |
+
<span id="span_temp0_8_HumanEval_ET">HumanEval-ET</span>
|
135 |
+
</div>
|
136 |
+
<div id="sec_plot__chart2" style="width:736.5px; height:600px;"></div>
|
137 |
+
</div>
|
138 |
+
</div>
|
139 |
+
<script src="chart.js"></script>
|
140 |
+
</section>
|
141 |
+
|
142 |
+
|
143 |
+
<section class="section_about" id="sec_about">
|
144 |
+
<h2>Context</h2>
|
145 |
+
<div>
|
146 |
+
<p>The growing number of code models released by the community necessitates a comprehensive evaluation to
|
147 |
+
reliably benchmark their capabilities.
|
148 |
+
Similar to the π€ Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on
|
149 |
+
multiple programming languages:</p>
|
150 |
+
<ul>
|
151 |
+
<li>HumanEval - benchmark for measuring functional correctness for synthesizing programs from
|
152 |
+
docstrings. It consists of 164 Python programming problems.</li>
|
153 |
+
<li>MultiPL-E - Translation of HumanEval to 18 programming languages.</li>
|
154 |
+
<li>Throughput Measurement - In addition to these benchmarks, we also measure model throughput on a
|
155 |
+
batch size of 1 and 50 to compare their inference speed.</li>
|
156 |
+
</ul>
|
157 |
+
<h3>Benchmark & Prompts</h3>
|
158 |
+
<ul>
|
159 |
+
<li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
|
160 |
+
<li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
|
161 |
+
where we separate base from instruction models.
|
162 |
+
We use the original code completion prompts for HumanEval for all base models, but for Instruction
|
163 |
+
models,
|
164 |
+
we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text
|
165 |
+
recommended by the authors of each model
|
166 |
+
(we also use a max generation length of 2048 instead of 512).</li>
|
167 |
+
</ul>
|
168 |
+
<p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
|
169 |
+
here.</p>
|
170 |
+
</div>
|
171 |
+
<div>
|
172 |
+
<p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
|
173 |
+
instruction versions.
|
174 |
+
Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
|
175 |
+
the last newline.
|
176 |
+
- Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based),
|
177 |
+
we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing
|
178 |
+
(instead of HumanEvalSynthesize),
|
179 |
+
code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p>
|
180 |
+
<h3>Evalution Parameters</h3>
|
181 |
+
<ul>
|
182 |
+
<li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2,
|
183 |
+
max_length_generation 512, and n_samples=50.</li>
|
184 |
+
</ul>
|
185 |
+
<h3>Throughput and Memory Usage</h3>
|
186 |
+
<ul>
|
187 |
+
<li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
|
188 |
+
Leaderboard. (0 throughput corresponds to OOM).</li>
|
189 |
+
</ul>
|
190 |
+
<h3>Scoring and Rankings</h3>
|
191 |
+
<ul>
|
192 |
+
<li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
|
193 |
+
language and compute num_models - (rank -1), then average this result over all languages.</li>
|
194 |
+
</ul>
|
195 |
+
<h3>Miscellaneous</h3>
|
196 |
+
<ul>
|
197 |
+
<li>#Languages column represents the number of programming languages included during the pretraining.
|
198 |
+
UNK means the number of languages is unknown.</li>
|
199 |
+
</ul>
|
200 |
+
</div>
|
201 |
+
</section>
|
202 |
+
|
203 |
+
<section class="section_submit" id="sec_submit">
|
204 |
+
<h2>How to submit models/results to the leaderboard?</h2>
|
205 |
+
<div>
|
206 |
+
<p>We welcome the community to submit evaluation results of new models. These results will be added as
|
207 |
+
non-verified, the authors are however required to upload their generations in case other members want to
|
208 |
+
check.</p>
|
209 |
+
<h3>1 - Running Evaluation</h3>
|
210 |
+
<p>We wrote a detailed guide for running the evaluation on your model. You can find the it in
|
211 |
+
bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in
|
212 |
+
addition to the raw generations and metric files.</p>
|
213 |
+
<h3>2- Submitting Results π</h3>
|
214 |
+
<p>To submit your results create a Pull Request in the community tab to add them under the folder
|
215 |
+
community_results in this repository:</p>
|
216 |
+
<ul>
|
217 |
+
<li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li>
|
218 |
+
<li>Put your json file with grouped scores from the guide, in addition generations folder and metrics
|
219 |
+
folder in it.</li>
|
220 |
+
</ul>
|
221 |
+
<p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace
|
222 |
+
org and model with those corresponding to the model you evaluated.</p>
|
223 |
+
</div>
|
224 |
+
</section>
|
225 |
+
|
226 |
+
|
227 |
+
|
228 |
+
<footer>
|
229 |
+
</footer>
|
230 |
+
|
231 |
+
<script src="button.js"></script>
|
232 |
+
</body>
|
233 |
+
|
234 |
+
</html>
|