Spaces:
Runtime error
Runtime error
title: BabelCode Eval | |
colorFrom: blue | |
colorTo: red | |
sdk: gradio | |
sdk_version: 3.19.1 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- metric | |
description: >- | |
This metric implements the evaluation harness for datasets translated with the | |
BabelCode framework as described in the paper "Measuring The Impact Of | |
Programming Language Distribution" (https://arxiv.org/abs/2302.01973). | |
# Metric Card for bc_eval | |
## Metric Description | |
This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973). | |
## How to Use | |
1. Generate predictions for BabelCode supported datasets | |
2. Aggregate the predictions by their question. | |
3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset. | |
4. Run the metric on the `predictions`, `languages`, and `question_infos`. | |
5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction. | |
```python | |
import evaluate | |
from datasets import load_dataset | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1" | |
predictions = [] | |
languages = [] | |
question_infos = [] | |
ds = load_dataset("gabeorlanski/bc-humaneval", split="test") | |
for row in ds: | |
languages.append(row['language']) | |
question_infos.append(row['question_info']) | |
# Replace this with however you generate and postprocess predictions. | |
predictions.append(model.generate(row['signature_with_docstring'])) | |
metric = evaluate.load("bc_eval") | |
metrics, results = metric.compute( | |
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] | |
) | |
``` | |
### Inputs | |
* `predictions`(`List[List[str]]`): The list of predictions for each question to execute. | |
* `languages`(`List[str]`): The language to use for each question. | |
* `question_dicts`(`List[Dict]`): The information for each question. | |
* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100]) | |
* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4). | |
* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None). | |
### Output Values | |
The `bc_eval` metric outputs two things: | |
* `metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}` | |
* `results`: a list of dictionaries with the results from each individual prediction. | |
#### Values from Popular Papers | |
[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding): | |
| Language | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B | | |
|------------|-----------|-----------|-----------------| | |
| C# | 24.22 | 20.5 | **26.09** | | |
| C++ | **34.16** | 21.74 | 24.22 | | |
| Go | 19.25 | 13.66 | **21.12** | | |
| Haskell | **8.7** | 1.86 | 1.86 | | |
| Java | **31.06** | 20.5 | 25.47 | | |
| JavaScript | **32.3** | 23.6 | 29.81 | | |
| Julia | **16.77** | 2.48 | 4.35 | | |
| Lua | **26.09** | 19.25 | 24.84 | | |
| PHP | **26.09** | 18.63 | 25.47 | | |
| Python | **34.16** | 17.39 | 26.71 | | |
| Rust | **28.57** | 16.15 | 22.98 | | |
| TypeScript | **32.3** | 17.39 | 30.43 | | |
### Examples | |
Full example with inputs that fail tests, time out, have an error, and pass. | |
#### Passing Example | |
```python | |
import evaluate | |
from datasets import load_dataset | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1" | |
ds = load_dataset("gabeorlanski/bc-humaneval", split="test") | |
example = ds[0] | |
metric = evaluate.load("bc_eval") | |
languages = ["Python"] | |
question_infos = [example["question_info"]] | |
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool: | |
for idx, elem in enumerate(numbers): | |
for idx2, elem2 in enumerate(numbers): | |
if idx != idx2: | |
distance = abs(elem - elem2) | |
if distance < threshold: | |
return True | |
return False""" | |
]] | |
metrics, results = metric.compute( | |
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] | |
) | |
``` | |
`metrics` is: | |
``` | |
{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0} | |
``` | |
`results` is: | |
``` | |
[{"qid": 0, "idx": "0", "file_path": ".../tmpqt_p3dwn/0", "results": [{"return_code": 0, "runtime": 0.076369, "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "PASSED", "2": "PASSED", "3": "PASSED", "4": "PASSED", "5": "PASSED", "6": "PASSED"}, "outcome": "PASSED"}] | |
``` | |
#### Fails Test Example | |
```python | |
import evaluate | |
from datasets import load_dataset | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1" | |
ds = load_dataset( | |
"gabeorlanski/bc-humaneval", "Python", split="test" | |
) | |
example = ds[0] | |
metric = evaluate.load("bc_eval") | |
languages = ["Python"] | |
question_infos = [example["question_info"]] | |
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool: | |
for idx, elem in enumerate(numbers): | |
for idx2, elem2 in enumerate(numbers): | |
if idx != idx2: | |
distance = elem - elem2 | |
if distance < threshold: | |
return True | |
return False""" | |
]] | |
metrics, results = metric.compute( | |
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] | |
) | |
``` | |
`metrics` is: | |
``` | |
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714} | |
``` | |
`results` is: | |
``` | |
[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}] | |
``` | |
Note that the individual test results are located in results. | |
#### Timeout Example | |
```python | |
import evaluate | |
from datasets import load_dataset | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1" | |
ds = load_dataset( | |
"gabeorlanski/bc-humaneval", "Python", split="test" | |
) | |
example = ds[0] | |
metric = evaluate.load("bc_eval") | |
languages = ["Python"] | |
question_infos = [example["question_info"]] | |
predictions = [["""import time | |
def has_close_elements(numbers: List[float], threshold: float) -> bool: | |
time.sleep(100) | |
""" | |
]] | |
metrics, results = metric.compute( | |
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] | |
) | |
``` | |
`metrics` is: | |
``` | |
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0} | |
``` | |
`results` is: | |
``` | |
[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}] | |
``` | |
#### Error Example | |
```python | |
import evaluate | |
from datasets import load_dataset | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1" | |
ds = load_dataset( | |
"gabeorlanski/bc-humaneval", "Python", split="test" | |
) | |
example = ds[0] | |
metric = evaluate.load("bc_eval") | |
languages = ["Python"] | |
question_infos = [example["question_info"]] | |
predictions = [["""import time | |
def has_close_elements(numbers: List[float], threshold: float) -> bool: | |
raise ValueError() | |
""", | |
"""def add(a, b): | |
return a+b""" | |
]] | |
metrics, results = metric.compute( | |
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] | |
) | |
``` | |
`metrics` is: | |
``` | |
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0} | |
``` | |
`results` is: | |
``` | |
[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"}, | |
{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}] | |
``` | |
## Limitations and Bias | |
This metric requires that the dataset be BabelCode compatible. | |
## Citation | |
``` | |
@article{orlanski2023measuring, | |
title={Measuring The Impact Of Programming Language Distribution}, | |
author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele}, | |
journal={arXiv preprint arXiv:2302.01973}, | |
year={2023} | |
} | |
``` | |