|
# `EvalPlus(π) => π` |
|
|
|
<p align="center"> |
|
<a href="https://evalplus.github.io/leaderboard.html"><img src="https://img.shields.io/badge/%F0%9F%8F%86-leaderboard-8A2BE2"></a> |
|
<a href="https://openreview.net/forum?id=1qvx610Cu7"><img src="https://img.shields.io/badge/Paper-NeurIPS'23-a55fed.svg"></a> |
|
<a href="https://huggingface.co/evalplus/"><img src="https://img.shields.io/badge/π€%20Hugging%20Face-evalplus-%23ff8811.svg"></a> |
|
<a href="https://pypi.org/project/evalplus/"><img src="https://img.shields.io/pypi/v/evalplus?color=g"></a> |
|
<a href="https://pepy.tech/project/evalplus"><img src="https://static.pepy.tech/badge/evalplus"></a> |
|
<a href="https://hub.docker.com/r/ganler/evalplus" title="Docker"><img src="https://img.shields.io/docker/image-size/ganler/evalplus"></a> |
|
<a href="https://github.com/evalplus/evalplus/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/evalplus"></a> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<a href="#-quick-start">π₯Quick Start</a> β’ |
|
<a href="#-llm-generated-code">π»LLM code</a> β’ |
|
<a href="#-useful-tools">π¨Tools</a> β’ |
|
<a href="#-citation">πCitation</a> β’ |
|
<a href="#-acknowledgement">πAcknowledgement</a> |
|
</p> |
|
|
|
> [!Important] |
|
> <div align="center"> |
|
> <b> |
|
> π’ Who is the best LLM coder? Take a look at <a href="https://evalplus.github.io/leaderboard.html">the EvalPlus leaderboard π</a>! π’ |
|
> </b> |
|
> <br> |
|
> <b> |
|
> π€ Request for independent model evaluation is <a href="https://github.com/evalplus/evalplus/issues/new/choose">open</a>! |
|
> </b> |
|
> </div> |
|
|
|
## About |
|
|
|
> [!Warning] |
|
> <div align="center"> |
|
> <b> |
|
> π¨ Evaluating LLM-generated code over datasets with "3 test-cases" is **NOT** enough! π¨ |
|
> </b> |
|
> </div> |
|
|
|
EvalPlus is a rigorous evaluation framework for LLM4Code, with: |
|
|
|
* β¨ **HumanEval+**: 80x more tests than the original HumanEval! |
|
* β¨ **MBPP+**: 35x more tests than the original MBPP! |
|
* β¨ **Evaluation framework**: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks. |
|
|
|
Why EvalPlus? What does using EvalPlus datasets bring to you? |
|
|
|
* β¨ **Reliable ranking**: See [our leaderboard](https://evalplus.github.io/leaderboard.html) for the latest LLM ranking before and after rigorous evaluation. |
|
* β¨ **Code robustness**: Look at the score differences! esp. before (e.g., HumanEval) and after (e.g., HumanEval+) using EvalPlus! The drop/gap indicates if the LLM can generate more robust code: less drop means more robustness and a larger drop means the code tends to be more fragile. |
|
* β¨**Pre-generated samples**: EvalPlus accelerates LLM4Code research by open-sourcing [LLM-generated samples](#-LLM-generated-code) for vairous models -- no need to re-run the expensive benchmarks! |
|
|
|
Want to know more details? Read our [**NeurIPS'23 paper**](https://openreview.net/forum?id=1qvx610Cu7) [](https://openreview.net/forum?id=1qvx610Cu7) as well as our [**Google Slides**](https://docs.google.com/presentation/d/1eTxzUQG9uHaU13BGhrqm4wH5NmMZiM3nI0ezKlODxKs)! |
|
|
|
## π₯ Quick Start |
|
|
|
To get started, please first setup the environment: |
|
|
|
```bash |
|
pip install evalplus --upgrade |
|
``` |
|
|
|
<details><summary>β¬ Install nightly version <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
```bash |
|
pip install "git+https://github.com/evalplus/evalplus.git" --upgrade |
|
``` |
|
|
|
</div> |
|
</details> |
|
|
|
<details><summary>β¬ Using EvalPlus as a local repo? <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
```bash |
|
git clone https://github.com/evalplus/evalplus.git |
|
cd evalplus |
|
export PYTHONPATH=$PYTHONPATH:$(pwd) |
|
pip install -r requirements.txt |
|
``` |
|
|
|
</div> |
|
</details> |
|
|
|
|
|
### Code generation |
|
|
|
Implement the `GEN_SOLUTION` function by calling the LLM to produce the complete solution (include the code) and save the samples to `samples.jsonl`: |
|
|
|
```python |
|
from evalplus.data import get_[human_eval|mbpp]_plus, write_jsonl |
|
|
|
samples = [ |
|
dict(task_id=task_id, solution=GEN_SOLUTION(problem["prompt"])) |
|
for task_id, problem in get_[human_eval|mbpp]_plus().items() |
|
] |
|
write_jsonl("samples.jsonl", samples) |
|
``` |
|
|
|
<details><summary>π€ Structure of `problem`? <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
* `task_id` is the identifier string for the task |
|
* `entry_point` is name of the function |
|
* `prompt` is the function signature with docstring |
|
+ `canonical_solution` is the ground-truth implementation (re-implemented to fix bugs in HumanEval) |
|
+ `base_input` is the test inputs in original HumanEval |
|
+ `plus_input` is the test inputs brought by EvalPlus |
|
|
|
</div> |
|
</details> |
|
|
|
> [!Note] |
|
> |
|
> **Expected Schema of `samples.jsonl`** |
|
> |
|
> 1. `task_id`: Task ID, which are the keys of `get_[human_eval|mbpp]_plus()` |
|
> 2. `solution` (optional): Self-contained solution (usually including the prompt) |
|
> * Example: `{"task_id": "HumanEval/?", "solution": "def f():\n return 1"}` |
|
> 3. `completion` (optional): Function body without prompt |
|
> * Example: `{"task_id": "HumanEval/?", "completion": " return 1"}` |
|
> |
|
> Only one of `solution` and `completion` is required. If both are provided, `solution` will be used. |
|
> We also accept solutions in the form of directory, i.e., `--samples ${SAMPLE_DIR}` where `${SAMPLE_DIR}` is organized as: `${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py` (`${TASK_ID} = task_id.replace("/", "_")`). |
|
|
|
### Code evaluation |
|
|
|
You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/): |
|
|
|
```bash |
|
docker run -v $(pwd):/app ganler/evalplus:latest --dataset [humaneval|mbpp] --samples samples.jsonl |
|
``` |
|
|
|
...Or if you want to try it locally regardless of the risks β οΈ: |
|
|
|
```bash |
|
evalplus.evaluate --dataset [humaneval|mbpp] --samples samples.jsonl |
|
``` |
|
|
|
> [!Warning] |
|
> |
|
> Do you use a very slow machine? |
|
> |
|
> LLM solutions are regarded as **failed** on timeout (and OOM etc.). |
|
> Specifically, we set the timeout $T=\max(T_{base}, T_{gt}\times k)$, where: |
|
> |
|
> - $T_{base}$ is the minimal timeout (configurable by `--min-time-limit`; default to 1s); |
|
> - $T_{gt}$ is the runtime of the ground-truth solutions (achieved via profiling); |
|
> - $k$ is a configurable factor `--gt-time-limit-factor` (default to 4); |
|
> |
|
> If your machine is too slow and you are getting high-variance results, try to use larger $k$ and $T_{base}$. |
|
> |
|
> Additionally, you are **NOT** encouraged to make your test-bed over stressed while running evaluation. |
|
> For example, using `--parallel 64` on a 4-core machine or doing something else during evaluation are bad ideas... |
|
|
|
<details><summary>π€ Evaluate with local GitHub repo? <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
```bash |
|
export PYTHONPATH=$PYTHONPATH:$(pwd) |
|
python evalplus/evaluate.py --dataset humaneval --samples samples.jsonl |
|
``` |
|
|
|
</div> |
|
</details> |
|
|
|
<details><summary>β¨οΈ More command-line flags <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
* `--parallel`: by default half of the cores |
|
* `--base-only` (store_ture): only run base HumanEval tests |
|
* `--i-just-wanna-run`: force a re-run |
|
|
|
</div> |
|
</details> |
|
|
|
The output should be like (below is GPT-4 greedy decoding example): |
|
|
|
``` |
|
Computing expected output... |
|
Expected outputs computed in 15.18s |
|
Reading samples... |
|
164it [00:04, 37.79it/s] |
|
Evaluating samples... |
|
100%|ββββββββββββββββββββββββββββββββββββββββββ| 164/164 [00:03<00:00, 44.75it/s] |
|
Base |
|
{'pass@1': 0.8841463414634146} |
|
Base + Extra |
|
{'pass@1': 0.768} |
|
``` |
|
|
|
- `Base` is the `pass@k` for the original HumanEval |
|
- `Base + Extra` is the `pass@k` for the our **HumanEval+** (with extra tests) |
|
- The "k" includes `[1, 10, 100]` where k values `<=` the sample size will be used |
|
- A cache file named like `samples_eval_results.jsonl` will be cached. Remove it to re-run the evaluation |
|
|
|
<details><summary>π€ How long it would take? <i>:: click to expand ::</i></summary> |
|
<div> |
|
|
|
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few seconds. |
|
When running 200 samples x 164 tasks x ~700+ tests, it can take around 2-10 minute by using `--parallel 64` and `--test-details`. |
|
Here are some tips to speed up the evaluation: |
|
|
|
* Use `--parallel $(nproc)` |
|
* Do **NOT** use `--test-details` if you just want to quickly get pass@k as `--test-details` will run all tests (700+ on average for each task), while without `--test-details` the testing for a sample stops immediately when it fails the first test. |
|
* Use our pre-evaluated results (see [LLM-generated code](#-LLM-generated-code)) |
|
* Use HumanEval+ Mini |
|
|
|
</div> |
|
</details> |
|
|
|
> [!Note] |
|
> |
|
> π **Try out `HumanEvalPlus-Mini`!** which selects a *minimal* set of additional tests with the highest quality, achieving almost the same effectiveness of the full version. Just add a **`--mini`** flag, it can run 23+% faster! (even faster if you evaluate all tests without fail-stop with `--test-details`). |
|
> |
|
> ```bash |
|
> docker run -v $(pwd):/app ganler/evalplus:latest --dataset humaneval --samples samples.jsonl --mini |
|
> # ...Or locally β οΈ |
|
> # evalplus.evaluate --dataset humaneval --samples samples.jsonl --mini |
|
> ``` |
|
|
|
|
|
## π» LLM-generated code |
|
|
|
We also share pre-generated code samples from LLMs we have [evaluated](https://evalplus.github.io/leaderboard.html): |
|
|
|
* **HumanEval+**: See the attachment of our [v0.1.0 release](https://github.com/evalplus/evalplus/releases/tag/v0.1.0). |
|
* **MBPP+**: See the attachment of our [v0.2.0 release](https://github.com/evalplus/evalplus/releases/tag/v0.2.0). |
|
|
|
Each sample file is packaged in a zip file named like `${model_name}_temp_${temperature}.zip`. |
|
You can unzip them to a folder named like `${model_name}_temp_${temperature}` and run the evaluation from scratch with: |
|
|
|
```bash |
|
evalplus.evaluate --dataset humaneval --samples ${model_name}_temp_${temperature} |
|
``` |
|
|
|
## π¨ Useful tools |
|
|
|
To use these tools, please first install the repository from GitHub: |
|
|
|
```bash |
|
git clone https://github.com/evalplus/evalplus.git |
|
cd evalplus |
|
pip install -r requirements-tools.txt |
|
``` |
|
|
|
### Syntax checker for LLM-generated code |
|
|
|
Check LLM-produced code and answer the following questions: |
|
|
|
1. Is the generation entirely done for all samples / all problems in the dataset? |
|
2. Are LLM-generated code compilable? (if no, something could be wrong and you'd better check) |
|
|
|
```shell |
|
# Set PYTHONPATH to run local Python files |
|
export PYTHONPATH=$PYTHONPATH:$(pwd) |
|
|
|
python tools/checker.py --samples samples.jsonl --dataset [humaneval|mbpp] |
|
# --samples can also be a directory organized as: ${SAMPLE_DIR}/${TASK_ID}/{SAMPLE_ID}.py |
|
``` |
|
|
|
### Post code sanitizer |
|
|
|
LLM-generated code may contain some syntax errors. |
|
But some of them can be easily fixable by doing simple post-processing. |
|
This tool will make the LLM-generated code more clean/compilable by doing certain post-processing such as trimming with more magical EOFs and some garbage non-code tokens. |
|
|
|
```shell |
|
# Set PYTHONPATH to run local Python files |
|
export PYTHONPATH=$PYTHONPATH:$(pwd) |
|
|
|
# π‘ If you are storing codes in directories: |
|
python tools/sanitize.py --samples samples.jsonl --dataset [humaneval|mbpp] |
|
# Sanitized code will be produced to `samples-sanitized.jsonl` |
|
|
|
# π‘ If you are storing codes in directories: |
|
python tools/sanitize.py --samples /path/to/vicuna-[??]b_temp_[??] --dataset [humaneval|mbpp] |
|
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized` |
|
``` |
|
|
|
You should now further check the validity of sanitized code with `tools/checker.py`. |
|
Sometimes (e.g., Chat models) there might be some natural language lines that impact the compilation. |
|
You might use `--rm-prefix-lines` to cut those NL lines with a prefix (e.g., `--rm-prefix-lines "Here's"`). |
|
|
|
### Render `pass@k` results to `rich` and LaTeX tables |
|
|
|
```shell |
|
python tools/render.py --type /path/to/[model]-[??]b # NOTE: no `_temp_[??]` |
|
``` |
|
|
|
 |
|
|
|
### Perform test input generation from scratch (TBD) |
|
|
|
|
|
### Name convention |
|
|
|
- `evalplus` is the package name. |
|
- `${DATASET}_plus` is the name of dataset applied with `evalplus`. |
|
|
|
## π Citation |
|
|
|
```bibtex |
|
@inproceedings{evalplus, |
|
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, |
|
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, |
|
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, |
|
year = {2023}, |
|
url = {https://openreview.net/forum?id=1qvx610Cu7}, |
|
} |
|
``` |
|
|
|
## π Acknowledgement |
|
|
|
- [HumanEval](https://github.com/openai/human-eval) |
|
- [MBPP](https://github.com/google-research/google-research/tree/master/mbpp) |
|
|