wzxii commited on
Commit
795f7e4
β€’
1 Parent(s): 889f4bb

Upload 2 files

Browse files
Files changed (2) hide show
  1. index.html +54 -77
  2. style.css +38 -59
index.html CHANGED
@@ -31,14 +31,14 @@
31
  </div>
32
 
33
  <div class="section_title__p">
34
- <p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a> and
 
35
  <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">πŸ€— Open LLM-Perf Leaderboard πŸ‹οΈ</a>,
36
- we compare performance of base code generation models on
37
  <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
38
- <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and
39
- provide information about the models.
40
- We compare both open and closed pre-trained code models, that people can start from as base models for
41
- their trainings.
42
  </p>
43
  </div>
44
  </section>
@@ -102,7 +102,7 @@
102
  <p>
103
  <ul>
104
  <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
105
- <!-- <li>For more details check the πŸ“ About section.</li> -->
106
  </ul>
107
  </div>
108
  </section>
@@ -134,85 +134,62 @@
134
 
135
 
136
  <section class="section_about" id="sec_about">
137
- <h2>Context</h2>
138
- <div>
139
- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
140
  reliably benchmark their capabilities.
141
  Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a>,
142
- we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p>
143
- <ul>
144
- <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>
145
- - benchmark for measuring functional correctness for synthesizing programs from docstrings.
146
- It consists of 164 Python programming problems.</li>
147
- <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>.</li>
148
- <li>MGI - In addition to these benchmarks, we also measure Memorization-Generalization Index.</li>
149
- </ul>
150
- <!-- <h3>Benchmark & Prompts</h3>
151
- <ul>
152
- <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
153
- <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
154
- where we separate base from instruction models.
155
- We use the original code completion prompts for HumanEval for all base models, but for Instruction
156
- models,
157
- we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text
158
- recommended by the authors of each model
159
- (we also use a max generation length of 2048 instead of 512).</li>
160
- </ul>
161
- <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
162
- here.</p>
163
- </div> -->
164
- <!-- <div>
165
- <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
166
- instruction versions.
167
- Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
168
- the last newline.
169
- - Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based),
170
- we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing
171
- (instead of HumanEvalSynthesize),
172
- code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p>
173
- <h3>Evalution Parameters</h3>
174
- <ul>
175
- <li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2,
176
- max_length_generation 512, and n_samples=50.</li>
177
- </ul>
178
- <h3>Throughput and Memory Usage</h3>
179
- <ul>
180
- <li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
181
- Leaderboard. (0 throughput corresponds to OOM).</li>
182
- </ul>
183
- <h3>Scoring and Rankings</h3>
184
- <ul>
185
- <li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
186
- language and compute num_models - (rank -1), then average this result over all languages.</li>
187
- </ul>
188
- <h3>Miscellaneous</h3>
189
- <ul>
190
- <li>#Languages column represents the number of programming languages included during the pretraining.
191
- UNK means the number of languages is unknown.</li>
192
- </ul>
193
- </div> -->
194
  </section>
195
 
196
  <section class="section_submit" id="sec_submit">
197
  <h2>How to submit models/results to the leaderboard?</h2>
198
  <div>
199
- <p>We welcome the community to submit evaluation results of new models. These results will be added as
200
- non-verified, the authors are however required to upload their generations in case other members want to
201
- check.</p>
202
- <h3>1 - Running Evaluation</h3>
203
- <p>We wrote a detailed guide for running the evaluation on your model. You can find the it in
204
- bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in
205
- addition to the raw generations and metric files.</p>
206
- <h3>2- Submitting Results πŸš€</h3>
207
- <p>To submit your results create a Pull Request in the community tab to add them under the folder
208
- community_results in this repository:</p>
209
  <ul>
210
- <li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li>
211
- <li>Put your json file with grouped scores from the guide, in addition generations folder and metrics
212
- folder in it.</li>
213
  </ul>
214
- <p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace
215
- org and model with those corresponding to the model you evaluated.</p>
216
  </div>
217
  </section>
218
 
 
31
  </div>
32
 
33
  <div class="section_title__p">
34
+ <p>
35
+ Inspired by the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a> and
36
  <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">πŸ€— Open LLM-Perf Leaderboard πŸ‹οΈ</a>,
37
+ we compare the performance of base code generation models on the
38
  <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
39
+ <a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a> benchmarks.
40
+ We also measure the Memorization-Generalization Index and provide the results for the models.
41
+ We compare both open-source and closed-source pre-trained code LLMs that can serve as base models for further training.
 
42
  </p>
43
  </div>
44
  </section>
 
102
  <p>
103
  <ul>
104
  <li>MGI stands for Memorization-Generalization Index, which is derived from Avg. Peak in the original paper. A higher MGI value indicates a greater propensity for a model to engage in memorization as opposed to generalization.</li>
105
+ <li>For more details check the πŸ“ About section.</li>
106
  </ul>
107
  </div>
108
  </section>
 
134
 
135
 
136
  <section class="section_about" id="sec_about">
137
+ <h3>Benchmarking and Prompts</h3>
138
+ <!-- <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
 
139
  reliably benchmark their capabilities.
140
  Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a>,
141
+ we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p> -->
142
+ <ul>
143
+ <li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>:
144
+ Used to measure the functional correctness of programs generated from docstrings. It includes 164 Python programming problems.
145
+ </li>
146
+ <li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>:
147
+ The extended version of HumanEval benchmark, where each task includes more than 100 test cases.
148
+ </li>
149
+ </ul>
150
+ <p>
151
+ For all models (except for the Starcoder family), we used the original benchmark prompts from <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and added a `&lt;bos&gt;` token before the provided prompt.
152
+ The maximum generation length was set to the length of the original prompt plus 300 tokens.
153
+ </p>
154
+ <p>
155
+ For the Starcoder family models (such as Starcoder2-7b and Starcoder2-15b),
156
+ we used the official bigcode-evaluation-harness for generation.
157
+ More details can be found <a href="https://github.com/bigcode-project/bigcode-evaluation-harness/" target="_blank" id="a_here">[here]</a>.
158
+ </p>
159
+ <h3>Evaluation Parameters</h3>
160
+ <p>
161
+ For all models, we sampled 1 and 50 samples under temperatures of 0 and 0.8, respectively,
162
+ for the subsequent result calculations. The parameters are set as follows:
163
+ </p>
164
+ <ul>
165
+ <li>top-p=1.0 (default parameter in the transformers library)</li>
166
+ <li>top-k=50 (default parameter in the transformers library)</li>
167
+ <li>max_length_generation=len(prompt)+300</li>
168
+ <li>temperature=0 or temperature=0.8</li>
169
+ <li>n_samples=50</li>
170
+ </ul>
171
+ <h3>Performance Metrics</h3>
172
+ <ul>
173
+ <li>pass@k: Represents the probability that the model successfully solves the test problem at least once out of `k` attempts.</li>
174
+ <li>MGI: The average peakedness of the edit distance distribution constructed by the mode samples.</li>
175
+ </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
  </section>
177
 
178
  <section class="section_submit" id="sec_submit">
179
  <h2>How to submit models/results to the leaderboard?</h2>
180
  <div>
181
+ <p>We welcome the community to submit evaluation results of new models.
182
+ These results will be added as non-verified, the authors are however required to upload their generations in case other members want to check.
183
+ </p>
184
+ <p>
185
+ To submit your results create a <span style="font-weight: bold;">Pull Request</span> in the community tab to add them under the
186
+ <a href="[https://github.com/YihongDong/CDD-TED4LLMs]" target="_blank">folder</a> <span class="span_">community_results</span> in the repository:
187
+ </p>
 
 
 
188
  <ul>
189
+ <li>Create a folder called <span class="span_">ORG_MODELNAME_USERNAME</span> for example <span class="span_">meta_CodeLlama_xxx</span>.</li>
190
+ <li>Put the generation outputs of your modle in it.</li>
 
191
  </ul>
192
+ <p>The title of the PR should be <span class="span_">[Community Submission] Model: org/model, Username: your_username</span>, replace org and model with those corresponding to the model you evaluated.</p>
 
193
  </div>
194
  </section>
195
 
style.css CHANGED
@@ -53,27 +53,21 @@
53
  border-bottom: transparent;
54
  background-color: rgb(255, 255, 255);
55
  font-size: 18px;
56
- /* font-weight: bold; */
57
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
58
  }
59
  #btn_evalTable {
60
  border-top-left-radius: 5px;
61
- /* border-bottom-left-radius: 5px; */
62
  border-right-width: 0;
63
  color: #000000;
64
  }
65
  #btn_plot {
66
  border-right-width: 0;
67
- /* border-top-right-radius: 5px; */
68
  }
69
  #btn_about {
70
- display: none
71
  }
72
  #btn_submit {
73
- display: none;
74
- border-top-right-radius: 5px;
75
- /* border-bottom-right-radius: 5px; */
76
- border-left-width: 0;
77
  }
78
  #btn_more {
79
  border-top-right-radius: 5px;
@@ -84,6 +78,7 @@
84
  }
85
  /* button */
86
 
 
87
  /* evalTable */
88
  .section_evalTable {
89
  display: block;
@@ -114,9 +109,6 @@
114
  .section_evalTable__table td.td_value {
115
  font-family: 'Courier New', Courier, monospace;
116
  }
117
- /* .section_evalTable__table td.td_HumanEval {
118
- font-family: Arial, Helvetica, sans-serif;
119
- } */
120
  .section_evalTable__table a {
121
  font-family:Verdana, Geneva, Tahoma, sans-serif;
122
  text-decoration-color: #0909f8;
@@ -141,10 +133,9 @@
141
  margin-top: 10px;
142
  margin-bottom: 0px;
143
  border-top: transparent;
144
- /* border: 1px solid rgb(228, 228, 228); */
145
  }
146
  .section_evalTable__notes p {
147
- font-size: 16px;
148
  background-color: #ffffff;
149
  }
150
  .section_evalTable__notes ul {
@@ -232,75 +223,72 @@
232
  margin-left: 130px;
233
  margin-right: 130px;
234
  padding-top: 10px;
235
- padding-bottom: 30px;
 
 
236
  border: 1px solid rgb(228, 228, 228);
237
- /* border-top-right-radius: 5px;
238
- border-bottom-left-radius: 5px;
239
- border-bottom-right-radius: 5px; */
240
- }
241
- .section_about h2 {
242
- font-size: 25px;
243
- font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
244
- /* display: flex;
245
- justify-content: center; */
246
  }
247
  .section_about ul {
248
  list-style: circle;
249
  padding-left: 20px;
250
  }
251
- .section_about ul, p, h3 {
252
- /* background-color: rgb(241, 248, 254); */
253
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
254
- line-height: 1.6em
255
  }
256
  .section_about h3 {
257
- font-size: 18px;
 
 
 
258
  }
259
  .section_about a {
260
  color: #386df4;
261
- text-decoration-color: #0909f8;
262
- text-decoration-style: dashed;
263
  }
264
- .section_about div {
265
- margin-top: 10px;
266
  }
267
  /* about */
268
 
269
  /* submit */
270
  .section_submit {
271
  display: none;
272
- /* margin-top: 30px; */
273
  margin-left: 130px;
274
  margin-right: 130px;
275
  padding-top: 10px;
276
- padding-bottom: 30px;
 
 
277
  border: 1px solid rgb(228, 228, 228);
278
- /* border-top-right-radius: 5px;
279
- border-bottom-left-radius: 5px;
280
- border-bottom-right-radius: 5px; */
281
  }
282
  .section_submit h2 {
283
- font-size: 25px;
284
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
285
  display: flex;
286
  justify-content: center;
 
287
  }
288
  .section_submit ul {
289
  list-style: circle;
290
  padding-left: 20px;
291
  }
292
- .section_submit ul, p, h3 {
293
- /* background-color: rgb(241, 248, 254); */
 
 
294
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
295
- line-height: 1.6em
296
  }
297
- .section_submit h3 {
298
- font-size: 18px;
 
299
  }
300
- .section_submit div {
301
- margin-top: 10px;
302
- /* border: dotted gainsboro;
303
- border-radius: 10px; */
304
  }
305
  /* submit */
306
 
@@ -308,30 +296,25 @@
308
  /* more */
309
  .section_more {
310
  display: none;
311
- /* margin-top: 30px; */
312
  margin-left: 130px;
313
  margin-right: 130px;
314
  padding-top: 10px;
315
- padding-bottom: 30px;
316
  padding-left: 20px;
317
  padding-right: 20px;
318
  border: 1px solid rgb(228, 228, 228);
319
- /* border-top-right-radius: 5px;
320
- border-bottom-left-radius: 5px;
321
- border-bottom-right-radius: 5px; */
322
  }
323
  .section_more h2 {
324
- font-size: 25px;
325
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
326
- /* display: flex;
327
- justify-content: center; */
328
  }
329
  .section_more p {
330
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
331
  margin-top: 10px;
 
332
  }
333
  .section_more ul {
334
- margin-top: 10px;
335
  list-style: circle;
336
  padding-left: 25px;
337
  font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
@@ -345,7 +328,3 @@
345
  }
346
  /* more */
347
 
348
-
349
- /* .u {
350
- color: #ad64d4
351
- } */
 
53
  border-bottom: transparent;
54
  background-color: rgb(255, 255, 255);
55
  font-size: 18px;
 
56
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
57
  }
58
  #btn_evalTable {
59
  border-top-left-radius: 5px;
 
60
  border-right-width: 0;
61
  color: #000000;
62
  }
63
  #btn_plot {
64
  border-right-width: 0;
 
65
  }
66
  #btn_about {
67
+ border-right-width: 0;
68
  }
69
  #btn_submit {
70
+ border-right-width: 0;
 
 
 
71
  }
72
  #btn_more {
73
  border-top-right-radius: 5px;
 
78
  }
79
  /* button */
80
 
81
+
82
  /* evalTable */
83
  .section_evalTable {
84
  display: block;
 
109
  .section_evalTable__table td.td_value {
110
  font-family: 'Courier New', Courier, monospace;
111
  }
 
 
 
112
  .section_evalTable__table a {
113
  font-family:Verdana, Geneva, Tahoma, sans-serif;
114
  text-decoration-color: #0909f8;
 
133
  margin-top: 10px;
134
  margin-bottom: 0px;
135
  border-top: transparent;
 
136
  }
137
  .section_evalTable__notes p {
138
+ line-height: 1.8em;
139
  background-color: #ffffff;
140
  }
141
  .section_evalTable__notes ul {
 
223
  margin-left: 130px;
224
  margin-right: 130px;
225
  padding-top: 10px;
226
+ padding-bottom: 10px;
227
+ padding-left: 20px;
228
+ padding-right: 20px;
229
  border: 1px solid rgb(228, 228, 228);
 
 
 
 
 
 
 
 
 
230
  }
231
  .section_about ul {
232
  list-style: circle;
233
  padding-left: 20px;
234
  }
235
+ .section_about ul, p {
 
236
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
237
+ line-height: 1.8em;
238
  }
239
  .section_about h3 {
240
+ padding-top: 12px;
241
+ padding-bottom: 12px;
242
+ font-size: 17px;
243
+ font-family: 'Trebuchet MS', 'Lucida Sans Unicode', 'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
244
  }
245
  .section_about a {
246
  color: #386df4;
247
+ text-decoration-color: #386df4;
248
+ text-decoration-style: solid;
249
  }
250
+ #a_here {
251
+ text-decoration: none;
252
  }
253
  /* about */
254
 
255
  /* submit */
256
  .section_submit {
257
  display: none;
 
258
  margin-left: 130px;
259
  margin-right: 130px;
260
  padding-top: 10px;
261
+ padding-bottom: 10px;
262
+ padding-left: 20px;
263
+ padding-right: 20px;
264
  border: 1px solid rgb(228, 228, 228);
 
 
 
265
  }
266
  .section_submit h2 {
267
+ font-size: 22px;
268
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
269
  display: flex;
270
  justify-content: center;
271
+ font-weight: 500;
272
  }
273
  .section_submit ul {
274
  list-style: circle;
275
  padding-left: 20px;
276
  }
277
+ .section_submit p{
278
+ margin-top: 12px;
279
+ }
280
+ .section_submit ul, p {
281
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
282
+ line-height: 1.8em
283
  }
284
+ .span_ {
285
+ background-color: #f3f3f3;
286
+ font-family:'Courier New', Courier, monospace;
287
  }
288
+ .section_submit a {
289
+ color: #386df4;
290
+ text-decoration-color: #0909f8;
291
+ text-decoration: solid;
292
  }
293
  /* submit */
294
 
 
296
  /* more */
297
  .section_more {
298
  display: none;
 
299
  margin-left: 130px;
300
  margin-right: 130px;
301
  padding-top: 10px;
302
+ padding-bottom: 10px;
303
  padding-left: 20px;
304
  padding-right: 20px;
305
  border: 1px solid rgb(228, 228, 228);
 
 
 
306
  }
307
  .section_more h2 {
308
+ font-size: 22px;
309
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
310
+ font-weight: 500;
 
311
  }
312
  .section_more p {
313
  font-family:'Gill Sans', 'Gill Sans MT', Calibri, 'Trebuchet MS', sans-serif;
314
  margin-top: 10px;
315
+ margin-bottom: 10px;
316
  }
317
  .section_more ul {
 
318
  list-style: circle;
319
  padding-left: 25px;
320
  font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
 
328
  }
329
  /* more */
330