Upload 2 files
Browse files- index.html +11 -11
- style.css +4 -3
index.html
CHANGED
@@ -137,16 +137,16 @@
|
|
137 |
<div>
|
138 |
<p>The growing number of code models released by the community necessitates a comprehensive evaluation to
|
139 |
reliably benchmark their capabilities.
|
140 |
-
Similar to the
|
141 |
-
multiple programming languages:</p>
|
142 |
<ul>
|
143 |
-
<li>HumanEval
|
144 |
-
|
145 |
-
|
146 |
-
<li
|
147 |
-
|
148 |
</ul>
|
149 |
-
<h3>Benchmark & Prompts</h3>
|
150 |
<ul>
|
151 |
<li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
|
152 |
<li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
|
@@ -159,8 +159,8 @@
|
|
159 |
</ul>
|
160 |
<p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
|
161 |
here.</p>
|
162 |
-
</div>
|
163 |
-
<div>
|
164 |
<p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
|
165 |
instruction versions.
|
166 |
Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
|
@@ -189,7 +189,7 @@
|
|
189 |
<li>#Languages column represents the number of programming languages included during the pretraining.
|
190 |
UNK means the number of languages is unknown.</li>
|
191 |
</ul>
|
192 |
-
</div>
|
193 |
</section>
|
194 |
|
195 |
<section class="section_submit" id="sec_submit">
|
|
|
137 |
<div>
|
138 |
<p>The growing number of code models released by the community necessitates a comprehensive evaluation to
|
139 |
reliably benchmark their capabilities.
|
140 |
+
Similar to the <a href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard" target="_blank">π€ Open LLM Leaderboard</a>,
|
141 |
+
we selected two common benchmarks for evaluating Code LLMs on multiple programming languages:</p>
|
142 |
<ul>
|
143 |
+
<li><a href="https://huggingface.co/datasets/openai_humaneval" target="_blank">HumanEval</a>
|
144 |
+
- benchmark for measuring functional correctness for synthesizing programs from docstrings.
|
145 |
+
It consists of 164 Python programming problems.</li>
|
146 |
+
<li><a href="https://github.com/YihongDong/CodeGenEvaluation" target="_blank">HumanEval-ET</a>.</li>
|
147 |
+
<li>MGI - In addition to these benchmarks, we also measure Memorization-Generalization Index.</li>
|
148 |
</ul>
|
149 |
+
<!-- <h3>Benchmark & Prompts</h3>
|
150 |
<ul>
|
151 |
<li>HumanEval-Python reports the pass@1 on HumanEval; the rest is from MultiPL-E benchmark.</li>
|
152 |
<li>For all languages, we use the original benchamrk prompts for all models except HumanEval-Python,
|
|
|
159 |
</ul>
|
160 |
<p>Figure below shows the example of OctoCoder vs Base HumanEval prompt, you can find the other prompts
|
161 |
here.</p>
|
162 |
+
</div> -->
|
163 |
+
<!-- <div>
|
164 |
<p>- An exception to this is the Phind models. They seem to follow to base prompts better than the
|
165 |
instruction versions.
|
166 |
Therefore, following the authors' recommendation we use base HumanEval prompts without stripping them of
|
|
|
189 |
<li>#Languages column represents the number of programming languages included during the pretraining.
|
190 |
UNK means the number of languages is unknown.</li>
|
191 |
</ul>
|
192 |
+
</div> -->
|
193 |
</section>
|
194 |
|
195 |
<section class="section_submit" id="sec_submit">
|
style.css
CHANGED
@@ -251,9 +251,10 @@
|
|
251 |
.section_about h3 {
|
252 |
font-size: 18px;
|
253 |
}
|
254 |
-
.section_about
|
255 |
-
|
256 |
-
|
|
|
257 |
}
|
258 |
.section_about div {
|
259 |
margin-top: 10px;
|
|
|
251 |
.section_about h3 {
|
252 |
font-size: 18px;
|
253 |
}
|
254 |
+
.section_about a {
|
255 |
+
color: #386df4;
|
256 |
+
text-decoration-color: #0909f8;
|
257 |
+
text-decoration-style: dashed;
|
258 |
}
|
259 |
.section_about div {
|
260 |
margin-top: 10px;
|