Upload 2 files
Browse files- index.html +54 -77
- style.css +38 -59
index.html
CHANGED
@@ -31,14 +31,14 @@
|
|
31 |
</div>
|
32 |
|
33 |
<div class="section_title__p">
|
34 |
-
<p>
|
|
|
35 |
<a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">π€ Open LLM-Perf Leaderboard ποΈ</a>,
|
36 |
-
we compare performance of base code generation models on
|
37 |
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
|
38 |
-
<a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>
|
39 |
-
provide
|
40 |
-
We compare both open and closed pre-trained code
|
41 |
-
their trainings.
|
42 |
</p>
|
43 |
</div>
|
44 |
</section>
|
@@ -102,7 +102,7 @@
|
|
102 |
<p>
|
103 |
<ul>
|
104 |
<li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
|
105 |
-
|
106 |
</ul>
|
107 |
</div>
|
108 |
</section>
|
@@ -134,85 +134,62 @@
|
|
134 |
|
135 |
|
136 |
<section class="section_about" id="sec_about">
|
137 |
-
<
|
138 |
-
|
139 |
-
<p>The growing number of code models released by the community necessitates a comprehensive evaluation to
|
140 |
reliably benchmark their capabilities.
|
141 |
Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a>,
|
142 |
-
we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p>
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
</
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
<
|
175 |
-
|
176 |
-
|
177 |
-
</ul>
|
178 |
-
<h3>Throughput and Memory Usage</h3>
|
179 |
-
<ul>
|
180 |
-
<li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
|
181 |
-
Leaderboard. (0 throughput corresponds to OOM).</li>
|
182 |
-
</ul>
|
183 |
-
<h3>Scoring and Rankings</h3>
|
184 |
-
<ul>
|
185 |
-
<li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
|
186 |
-
language and compute num_models - (rank -1), then average this result over all languages.</li>
|
187 |
-
</ul>
|
188 |
-
<h3>Miscellaneous</h3>
|
189 |
-
<ul>
|
190 |
-
<li>#Languages column represents the number of programming languages included during the pretraining.
|
191 |
-
UNK means the number of languages is unknown.</li>
|
192 |
-
</ul>
|
193 |
-
</div> -->
|
194 |
</section>
|
195 |
|
196 |
<section class="section_submit" id="sec_submit">
|
197 |
<h2>How to submit models/results to the leaderboard?</h2>
|
198 |
<div>
|
199 |
-
<p>We welcome the community to submit evaluation results of new models.
|
200 |
-
non-verified, the authors are however required to upload their generations in case other members want to
|
201 |
-
|
202 |
-
<
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
<h3>2- Submitting Results π</h3>
|
207 |
-
<p>To submit your results create a Pull Request in the community tab to add them under the folder
|
208 |
-
community_results in this repository:</p>
|
209 |
<ul>
|
210 |
-
<li>Create a folder called ORG_MODELNAME_USERNAME
|
211 |
-
<li>Put
|
212 |
-
folder in it.</li>
|
213 |
</ul>
|
214 |
-
<p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username
|
215 |
-
org and model with those corresponding to the model you evaluated.</p>
|
216 |
</div>
|
217 |
</section>
|
218 |
|
|
|
31 |
</div>
|
32 |
|
33 |
<div class="section_title__p">
|
34 |
+
<p>
|
35 |
+
Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a> and
|
36 |
<a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">π€ Open LLM-Perf Leaderboard ποΈ</a>,
|
37 |
+
we compare the performance of base code generation models on the
|
38 |
<a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
|
39 |
+
<a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
|
40 |
+
We also measure the Memorization-Generalization Index and provide the results for the models.
|
41 |
+
We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
|
|
|
42 |
</p>
|
43 |
</div>
|
44 |
</section>
|
|
|
102 |
<p>
|
103 |
<ul>
|
104 |
<li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
|
105 |
+
<li>For more details check the π About section.</li>
|
106 |
</ul>
|
107 |
</div>
|
108 |
</section>
|
|
|
134 |
|
135 |
|
136 |
<section class="section_about" id="sec_about">
|
137 |
+
<h3>Benchmarking and Prompts</h3>
|
138 |
+
<!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
|
|
|
139 |
reliably benchmark their capabilities.
|
140 |
Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a>,
|
141 |
+
we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
|
142 |
+
<ul>
|
143 |
+
<li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>:
|
144 |
+
Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
|
145 |
+
</li>
|
146 |
+
<li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>:
|
147 |
+
The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
|
148 |
+
</li>
|
149 |
+
</ul>
|
150 |
+
<p>
|
151 |
+
For all models (except for the Starcoder family), we used the original benchmark prompts from <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and added a `<bos>` token before the provided prompt.
|
152 |
+
The maximum generation length was set to the length of the original prompt plus 300 tokens.
|
153 |
+
</p>
|
154 |
+
<p>
|
155 |
+
For the Starcoder family models (such as Starcoder2-7b and Starcoder2-15b),
|
156 |
+
we used the official bigcode-evaluation-harness for generation.
|
157 |
+
More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank" id="a_here">[here]</a>.
|
158 |
+
</p>
|
159 |
+
<h3>Evaluation Parameters</h3>
|
160 |
+
<p>
|
161 |
+
For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively,
|
162 |
+
for the subsequent result calculations. The parameters are set as follows:
|
163 |
+
</p>
|
164 |
+
<ul>
|
165 |
+
<li>top-p=1.0 (default parameter in the transformers library)</li>
|
166 |
+
<li>top-k=50 (default parameter in the transformers library)</li>
|
167 |
+
<li>max_length_generation=len(prompt)+300</li>
|
168 |
+
<li>temperature=0 or temperature=0.8</li>
|
169 |
+
<li>n_samples=50</li>
|
170 |
+
</ul>
|
171 |
+
<h3>Performance Metrics</h3>
|
172 |
+
<ul>
|
173 |
+
<li>pass@k: Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
|
174 |
+
<li>MGI: The average peakedness of the edit distance distribution constructed by the mode samples.</li>
|
175 |
+
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
176 |
</section>
|
177 |
|
178 |
<section class="section_submit" id="sec_submit">
|
179 |
<h2>How to submit models/results to the leaderboard?</h2>
|
180 |
<div>
|
181 |
+
<p>We welcome the community to submit evaluation results of new models.
|
182 |
+
These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
|
183 |
+
</p>
|
184 |
+
<p>
|
185 |
+
To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the
|
186 |
+
<a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
|
187 |
+
</p>
|
|
|
|
|
|
|
188 |
<ul>
|
189 |
+
<li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span> for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
|
190 |
+
<li>Put the generation outputs of your modle in it.</li>
|
|
|
191 |
</ul>
|
192 |
+
<p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
|
|
|
193 |
</div>
|
194 |
</section>
|
195 |
|
style.css
CHANGED
@@ -53,27 +53,21 @@
|
|
53 |
border-bottom: transparent;
|
54 |
background-color: rgb(255, 255, 255);
|
55 |
font-size: 18px;
|
56 |
-
/* font-weight: bold; */
|
57 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
58 |
}
|
59 |
#btn_evalTable {
|
60 |
border-top-left-radius: 5px;
|
61 |
-
/* border-bottom-left-radius: 5px; */
|
62 |
border-right-width: 0;
|
63 |
color: #000000;
|
64 |
}
|
65 |
#btn_plot {
|
66 |
border-right-width: 0;
|
67 |
-
/* border-top-right-radius: 5px; */
|
68 |
}
|
69 |
#btn_about {
|
70 |
-
|
71 |
}
|
72 |
#btn_submit {
|
73 |
-
|
74 |
-
border-top-right-radius: 5px;
|
75 |
-
/* border-bottom-right-radius: 5px; */
|
76 |
-
border-left-width: 0;
|
77 |
}
|
78 |
#btn_more {
|
79 |
border-top-right-radius: 5px;
|
@@ -84,6 +78,7 @@
|
|
84 |
}
|
85 |
/* button */
|
86 |
|
|
|
87 |
/* evalTable */
|
88 |
.section_evalTable {
|
89 |
display: block;
|
@@ -114,9 +109,6 @@
|
|
114 |
.section_evalTable__table td.td_value {
|
115 |
font-family: 'Courier New', Courier, monospace;
|
116 |
}
|
117 |
-
/* .section_evalTable__table td.td_HumanEval {
|
118 |
-
font-family: Arial, Helvetica, sans-serif;
|
119 |
-
} */
|
120 |
.section_evalTable__table a {
|
121 |
font-family:Verdana, Geneva, Tahoma, sans-serif;
|
122 |
text-decoration-color: #0909f8;
|
@@ -141,10 +133,9 @@
|
|
141 |
margin-top: 10px;
|
142 |
margin-bottom: 0px;
|
143 |
border-top: transparent;
|
144 |
-
/* border: 1px solid rgb(228, 228, 228); */
|
145 |
}
|
146 |
.section_evalTable__notes p {
|
147 |
-
|
148 |
background-color: #ffffff;
|
149 |
}
|
150 |
.section_evalTable__notes ul {
|
@@ -232,75 +223,72 @@
|
|
232 |
margin-left: 130px;
|
233 |
margin-right: 130px;
|
234 |
padding-top: 10px;
|
235 |
-
padding-bottom:
|
|
|
|
|
236 |
border: 1px solid rgb(228, 228, 228);
|
237 |
-
/* border-top-right-radius: 5px;
|
238 |
-
border-bottom-left-radius: 5px;
|
239 |
-
border-bottom-right-radius: 5px; */
|
240 |
-
}
|
241 |
-
.section_about h2 {
|
242 |
-
font-size: 25px;
|
243 |
-
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
244 |
-
/* display: flex;
|
245 |
-
justify-content: center; */
|
246 |
}
|
247 |
.section_about ul {
|
248 |
list-style: circle;
|
249 |
padding-left: 20px;
|
250 |
}
|
251 |
-
.section_about ul, p
|
252 |
-
/* background-color: rgb(241, 248, 254); */
|
253 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
254 |
-
line-height: 1.
|
255 |
}
|
256 |
.section_about h3 {
|
257 |
-
|
|
|
|
|
|
|
258 |
}
|
259 |
.section_about a {
|
260 |
color: #386df4;
|
261 |
-
text-decoration-color: #
|
262 |
-
text-decoration-style:
|
263 |
}
|
264 |
-
|
265 |
-
|
266 |
}
|
267 |
/* about */
|
268 |
|
269 |
/* submit */
|
270 |
.section_submit {
|
271 |
display: none;
|
272 |
-
/* margin-top: 30px; */
|
273 |
margin-left: 130px;
|
274 |
margin-right: 130px;
|
275 |
padding-top: 10px;
|
276 |
-
padding-bottom:
|
|
|
|
|
277 |
border: 1px solid rgb(228, 228, 228);
|
278 |
-
/* border-top-right-radius: 5px;
|
279 |
-
border-bottom-left-radius: 5px;
|
280 |
-
border-bottom-right-radius: 5px; */
|
281 |
}
|
282 |
.section_submit h2 {
|
283 |
-
font-size:
|
284 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
285 |
display: flex;
|
286 |
justify-content: center;
|
|
|
287 |
}
|
288 |
.section_submit ul {
|
289 |
list-style: circle;
|
290 |
padding-left: 20px;
|
291 |
}
|
292 |
-
.section_submit
|
293 |
-
|
|
|
|
|
294 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
295 |
-
line-height: 1.
|
296 |
}
|
297 |
-
.
|
298 |
-
|
|
|
299 |
}
|
300 |
-
.section_submit
|
301 |
-
|
302 |
-
|
303 |
-
|
304 |
}
|
305 |
/* submit */
|
306 |
|
@@ -308,30 +296,25 @@
|
|
308 |
/* more */
|
309 |
.section_more {
|
310 |
display: none;
|
311 |
-
/* margin-top: 30px; */
|
312 |
margin-left: 130px;
|
313 |
margin-right: 130px;
|
314 |
padding-top: 10px;
|
315 |
-
padding-bottom:
|
316 |
padding-left: 20px;
|
317 |
padding-right: 20px;
|
318 |
border: 1px solid rgb(228, 228, 228);
|
319 |
-
/* border-top-right-radius: 5px;
|
320 |
-
border-bottom-left-radius: 5px;
|
321 |
-
border-bottom-right-radius: 5px; */
|
322 |
}
|
323 |
.section_more h2 {
|
324 |
-
font-size:
|
325 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
326 |
-
|
327 |
-
justify-content: center; */
|
328 |
}
|
329 |
.section_more p {
|
330 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
331 |
margin-top: 10px;
|
|
|
332 |
}
|
333 |
.section_more ul {
|
334 |
-
margin-top: 10px;
|
335 |
list-style: circle;
|
336 |
padding-left: 25px;
|
337 |
font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
|
@@ -345,7 +328,3 @@
|
|
345 |
}
|
346 |
/* more */
|
347 |
|
348 |
-
|
349 |
-
/* .u {
|
350 |
-
color: #ad64d4
|
351 |
-
} */
|
|
|
53 |
border-bottom: transparent;
|
54 |
background-color: rgb(255, 255, 255);
|
55 |
font-size: 18px;
|
|
|
56 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
57 |
}
|
58 |
#btn_evalTable {
|
59 |
border-top-left-radius: 5px;
|
|
|
60 |
border-right-width: 0;
|
61 |
color: #000000;
|
62 |
}
|
63 |
#btn_plot {
|
64 |
border-right-width: 0;
|
|
|
65 |
}
|
66 |
#btn_about {
|
67 |
+
border-right-width: 0;
|
68 |
}
|
69 |
#btn_submit {
|
70 |
+
border-right-width: 0;
|
|
|
|
|
|
|
71 |
}
|
72 |
#btn_more {
|
73 |
border-top-right-radius: 5px;
|
|
|
78 |
}
|
79 |
/* button */
|
80 |
|
81 |
+
|
82 |
/* evalTable */
|
83 |
.section_evalTable {
|
84 |
display: block;
|
|
|
109 |
.section_evalTable__table td.td_value {
|
110 |
font-family: 'Courier New', Courier, monospace;
|
111 |
}
|
|
|
|
|
|
|
112 |
.section_evalTable__table a {
|
113 |
font-family:Verdana, Geneva, Tahoma, sans-serif;
|
114 |
text-decoration-color: #0909f8;
|
|
|
133 |
margin-top: 10px;
|
134 |
margin-bottom: 0px;
|
135 |
border-top: transparent;
|
|
|
136 |
}
|
137 |
.section_evalTable__notes p {
|
138 |
+
line-height: 1.8em;
|
139 |
background-color: #ffffff;
|
140 |
}
|
141 |
.section_evalTable__notes ul {
|
|
|
223 |
margin-left: 130px;
|
224 |
margin-right: 130px;
|
225 |
padding-top: 10px;
|
226 |
+
padding-bottom: 10px;
|
227 |
+
padding-left: 20px;
|
228 |
+
padding-right: 20px;
|
229 |
border: 1px solid rgb(228, 228, 228);
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
230 |
}
|
231 |
.section_about ul {
|
232 |
list-style: circle;
|
233 |
padding-left: 20px;
|
234 |
}
|
235 |
+
.section_about ul, p {
|
|
|
236 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
237 |
+
line-height: 1.8em;
|
238 |
}
|
239 |
.section_about h3 {
|
240 |
+
padding-top: 12px;
|
241 |
+
padding-bottom: 12px;
|
242 |
+
font-size: 17px;
|
243 |
+
font-family: 'Trebuchet MS', 'Lucida Sans Unicode', 'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
|
244 |
}
|
245 |
.section_about a {
|
246 |
color: #386df4;
|
247 |
+
text-decoration-color: #386df4;
|
248 |
+
text-decoration-style: solid;
|
249 |
}
|
250 |
+
#a_here {
|
251 |
+
text-decoration: none;
|
252 |
}
|
253 |
/* about */
|
254 |
|
255 |
/* submit */
|
256 |
.section_submit {
|
257 |
display: none;
|
|
|
258 |
margin-left: 130px;
|
259 |
margin-right: 130px;
|
260 |
padding-top: 10px;
|
261 |
+
padding-bottom: 10px;
|
262 |
+
padding-left: 20px;
|
263 |
+
padding-right: 20px;
|
264 |
border: 1px solid rgb(228, 228, 228);
|
|
|
|
|
|
|
265 |
}
|
266 |
.section_submit h2 {
|
267 |
+
font-size: 22px;
|
268 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
269 |
display: flex;
|
270 |
justify-content: center;
|
271 |
+
font-weight: 500;
|
272 |
}
|
273 |
.section_submit ul {
|
274 |
list-style: circle;
|
275 |
padding-left: 20px;
|
276 |
}
|
277 |
+
.section_submit p{
|
278 |
+
margin-top: 12px;
|
279 |
+
}
|
280 |
+
.section_submit ul, p {
|
281 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
282 |
+
line-height: 1.8em
|
283 |
}
|
284 |
+
.span_ {
|
285 |
+
background-color: #f3f3f3;
|
286 |
+
font-family:'Courier New', Courier, monospace;
|
287 |
}
|
288 |
+
.section_submit a {
|
289 |
+
color: #386df4;
|
290 |
+
text-decoration-color: #0909f8;
|
291 |
+
text-decoration: solid;
|
292 |
}
|
293 |
/* submit */
|
294 |
|
|
|
296 |
/* more */
|
297 |
.section_more {
|
298 |
display: none;
|
|
|
299 |
margin-left: 130px;
|
300 |
margin-right: 130px;
|
301 |
padding-top: 10px;
|
302 |
+
padding-bottom: 10px;
|
303 |
padding-left: 20px;
|
304 |
padding-right: 20px;
|
305 |
border: 1px solid rgb(228, 228, 228);
|
|
|
|
|
|
|
306 |
}
|
307 |
.section_more h2 {
|
308 |
+
font-size: 22px;
|
309 |
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
310 |
+
font-weight: 500;
|
|
|
311 |
}
|
312 |
.section_more p {
|
313 |
font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
|
314 |
margin-top: 10px;
|
315 |
+
margin-bottom: 10px;
|
316 |
}
|
317 |
.section_more ul {
|
|
|
318 |
list-style: circle;
|
319 |
padding-left: 25px;
|
320 |
font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
|
|
|
328 |
}
|
329 |
/* more */
|
330 |
|
|
|
|
|
|
|
|