wzxii commited on
Commit
cf11127
β€’
1 Parent(s): 52e2576

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +234 -19
index.html CHANGED
@@ -1,19 +1,234 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
19
- </html>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+
4
+ <head>
5
+ <meta charset="UTF-8">
6
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
7
+ <title>Memorization or Generation of Big Code Model Leaderboard</title>
8
+ <link rel="stylesheet" href="style.css">
9
+ <script src="echarts.min.js"></script>
10
+ </head>
11
+
12
+ <body>
13
+
14
+ <section class="section_title">
15
+ <h1>
16
+ ⭐ <span style="color: rgb(223, 194, 25);">Memorization</span> or
17
+ <span style="color: rgb(223, 194, 25);">Generation</span>
18
+ of Big
19
+ <span style="color: rgb(223, 194, 25);">Code</span>
20
+ Model
21
+ <span style="color: rgb(223, 194, 25);">Leaderboard</span>
22
+ </h1>
23
+
24
+ <div class="section_title__imgs">
25
+ <a href="https://github.com/YihongDong/CDD-TED4LLMs" id="a_github" target="_blank">
26
+ <img src="https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white">
27
+ </a>
28
+ <a href="https://arxiv.org/abs/2402.15938" id="a_arxiv" target="_blank">
29
+ <img src="https://img.shields.io/badge/PAPER-ACL'24-ad64d4.svg?style=for-the-badge">
30
+ </a>
31
+ </div>
32
+
33
+ <div class="section_title__p">
34
+ <p>Inspired from the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">πŸ€— Open LLM Leaderboard</a> and
35
+ <a href="https://huggingface.co/spaces/optimum/llm-perf-leaderboard" target="_blank">πŸ€— Open LLM-Perf Leaderboard πŸ‹οΈ</a>,
36
+ we compare performance of base code generation models on
37
+ <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
38
+ <a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a> benchamrk. We also measure Memorization-Generalization Index and
39
+ provide information about the models.
40
+ We only compare open pre-trained code models, that people can start from as base models for
41
+ their trainings.
42
+ </p>
43
+ </div>
44
+ </section>
45
+
46
+ <section class="section_button">
47
+ <button id="btn_evalTable">πŸ” Evalution Table</button>
48
+ <button id="btn_plot">πŸ“Š Performance Plot</button>
49
+ <button id="btn_about">πŸ“ About</button>
50
+ <button id="btn_submit">πŸš€ Submit results</button>
51
+ </section>
52
+
53
+ <section class="section_evalTable" id="sec_evalTable">
54
+ <div class="section_evalTable__table">
55
+ <table id="evalTable">
56
+ <colgroup>
57
+ <col style="width: 8%">
58
+ <col style="width: 22%">
59
+ <col style="width: 22%">
60
+ <col style="width: 12%">
61
+ <col style="width: 12%">
62
+ <col style="width: 12%">
63
+ <col style="width: 12%">
64
+ </colgroup>
65
+
66
+ <thead>
67
+ <th rowspan="2">Benchmark</th>
68
+ <th rowspan="2">Model
69
+ <button class="button_sort" data-direction="desc" data-type="name"></button>
70
+ </th>
71
+ <th data-direction="desc" rowspan="2" data-type="MGI">MGI,
72
+ <br/>Memorization-Generalization Index
73
+ <br/>(Ori: Avg. Peak)
74
+ <button class="button_sort" data-direction="desc" data-type="MGI"></button>
75
+ </th>
76
+ <th colspan="2">Pass@1(temp=0)</th>
77
+ <th colspan="2">Pass@1(temp=0.8)</th>
78
+ <tr>
79
+ <th>HumanEval
80
+ <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval"></button>
81
+ </th>
82
+ <th>HumanEval-ET
83
+ <button class="button_sort" data-direction="desc" data-type="temp0_HumanEval_ET"></button>
84
+ </th>
85
+ <th>HumanEval
86
+ <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval"></button>
87
+ </th>
88
+ <th>HumanEval-ET
89
+ <button class="button_sort" data-direction="desc" data-type="temp0_8_HumanEval_ET"></button>
90
+ </th>
91
+ </tr>
92
+ </thead>
93
+
94
+ <tbody>
95
+
96
+ </tbody>
97
+ </table>
98
+ </table>
99
+ <script src="table.js"></script>
100
+ </div>
101
+
102
+ <div class="section_evalTable__notes">
103
+ <p><strong>Notes</strong>
104
+ <p>
105
+ <ul>
106
+ <li>MGI represents Memorization-Generalization Index, originally referred to as Contamination Ratio.</li>
107
+ <li>The scores of instruction-tuned models might be significantly higher on humaneval-python than other
108
+ languages.
109
+ We use the instruction format of
110
+ <a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a> and
111
+ <a href="https://huggingface.co/datasets/dz1/CodeScore-HumanEval-ET" target="_blank">HumanEval-ET</a>.</li>
112
+ <li>For more details check the πŸ“ About section.</li>
113
+ </ul>
114
+ </div>
115
+ </section>
116
+
117
+ <section class="section_plot" id="sec_plot">
118
+ <div style="display: flex;">
119
+ <div class="section_plot__div" id="sec_plot__div1">
120
+ <div class="section_plot__btnGroup" id="sec_plot__btnGroup1">
121
+ <button id="btn_temp0_HumanEval"></button>
122
+ <span id="span_temp0_HumanEval">HumanEval</span>
123
+ <button id="btn_temp0_HumanEval_ET"></button>
124
+ <span id="span_temp0_HumanEval_ET">HumanEval-ET</span>
125
+ </div>
126
+ <div id="sec_plot__chart1" style="width:736.5px; height:600px;"></div>
127
+ </div>
128
+
129
+ <div class="section_plot__div" id="sec_plot__div2">
130
+ <div class="section_plot__btnGroup" id="sec_plot__btnGroup2">
131
+ <button id="btn_temp0_8_HumanEval"></button>
132
+ <span id="span_temp0_8_HumanEval">HumanEval</span>
133
+ <button id="btn_temp0_8_HumanEval_ET"></button>
134
+ <span id="span_temp0_8_HumanEval_ET">HumanEval-ET</span>
135
+ </div>
136
+ <div id="sec_plot__chart2" style="width:736.5px; height:600px;"></div>
137
+ </div>
138
+ </div>
139
+ <script src="chart.js"></script>
140
+ </section>
141
+
142
+
143
+ <section class="section_about" id="sec_about">
144
+ <h2>Context</h2>
145
+ <div>
146
+ <p>The growing number of code models released by the community necessitates a comprehensive evaluation to
147
+ reliably benchmark their capabilities.
148
+ Similar to the πŸ€— Open LLM Leaderboard, we selected two common benchmarks for evaluating Code LLMs on
149
+ multiple programming languages:</p>
150
+ <ul>
151
+ <li>HumanEval - benchmark for measuring functional correctness for synthesizing programs from
152
+ docstrings. It consists of 164 Python programming problems.</li>
153
+ <li>MultiPL-E - Translation of HumanEval to 18 programming languages.</li>
154
+ <li>Throughput Measurement - In addition to these benchmarks, we also measure model throughput on a
155
+ batch size of 1 and 50 to compare their inference speed.</li>
156
+ </ul>
157
+ <h3>Benchmark & Prompts</h3>
158
+ <ul>
159
+ <li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
160
+ <li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
161
+ where we separate base from instruction models.
162
+ We use the original code completion prompts for HumanEval for all base models, but for Instruction
163
+ models,
164
+ we use the Instruction version of HumanEval in HumanEvalSynthesize delimited by the tokens/text
165
+ recommended by the authors of each model
166
+ (we also use a max generation length of 2048 instead of 512).</li>
167
+ </ul>
168
+ <p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
169
+ here.</p>
170
+ </div>
171
+ <div>
172
+ <p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
173
+ instruction versions.
174
+ Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
175
+ the last newline.
176
+ - Also note that for WizardCoder-Python-34B-V1.0 & WizardCoder-Python-13B-V1.0 (CodeLLaMa based),
177
+ we use the HumanEval-Python instruction prompt that the original authors used with their postprocessing
178
+ (instead of HumanEvalSynthesize),
179
+ code is available [here](https://github.com/bigcode-project/bigcode-evaluation-harness/pull/133).</p>
180
+ <h3>Evalution Parameters</h3>
181
+ <ul>
182
+ <li>All models were evaluated with the bigcode-evaluation-harness with top-p=0.95, temperature=0.2,
183
+ max_length_generation 512, and n_samples=50.</li>
184
+ </ul>
185
+ <h3>Throughput and Memory Usage</h3>
186
+ <ul>
187
+ <li>Throughputs and peak memory usage are measured using Optimum-Benchmark which powers Open LLM-Perf
188
+ Leaderboard. (0 throughput corresponds to OOM).</li>
189
+ </ul>
190
+ <h3>Scoring and Rankings</h3>
191
+ <ul>
192
+ <li>Average score is the average pass@1 over all languages. For Win Rate, we find model rank for each
193
+ language and compute num_models - (rank -1), then average this result over all languages.</li>
194
+ </ul>
195
+ <h3>Miscellaneous</h3>
196
+ <ul>
197
+ <li>#Languages column represents the number of programming languages included during the pretraining.
198
+ UNK means the number of languages is unknown.</li>
199
+ </ul>
200
+ </div>
201
+ </section>
202
+
203
+ <section class="section_submit" id="sec_submit">
204
+ <h2>How to submit models/results to the leaderboard?</h2>
205
+ <div>
206
+ <p>We welcome the community to submit evaluation results of new models. These results will be added as
207
+ non-verified, the authors are however required to upload their generations in case other members want to
208
+ check.</p>
209
+ <h3>1 - Running Evaluation</h3>
210
+ <p>We wrote a detailed guide for running the evaluation on your model. You can find the it in
211
+ bigcode-evaluation-harness/leaderboard. This will generate a json file summarizing the results, in
212
+ addition to the raw generations and metric files.</p>
213
+ <h3>2- Submitting Results πŸš€</h3>
214
+ <p>To submit your results create a Pull Request in the community tab to add them under the folder
215
+ community_results in this repository:</p>
216
+ <ul>
217
+ <li>Create a folder called ORG_MODELNAME_USERNAME for example bigcode_starcoder_loubnabnl</li>
218
+ <li>Put your json file with grouped scores from the guide, in addition generations folder and metrics
219
+ folder in it.</li>
220
+ </ul>
221
+ <p>The title of the PR should be [Community Submission] Model: org/model, Username: your_username, replace
222
+ org and model with those corresponding to the model you evaluated.</p>
223
+ </div>
224
+ </section>
225
+
226
+
227
+
228
+ <footer>
229
+ </footer>
230
+
231
+ <script src="button.js"></script>
232
+ </body>
233
+
234
+ </html>