bc_eval / README.md
gabeorlanski's picture
Fix
87449d6 unverified
---
title: BabelCode Eval
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
This metric implements the evaluation harness for datasets translated with the
BabelCode framework as described in the paper "Measuring The Impact Of
Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
---
# Metric Card for bc_eval
## Metric Description
This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
## How to Use
1. Generate predictions for BabelCode supported datasets
2. Aggregate the predictions by their question.
3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset.
4. Run the metric on the `predictions`, `languages`, and `question_infos`.
5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.
```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
predictions = []
languages = []
question_infos = []
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
for row in ds:
languages.append(row['language'])
question_infos.append(row['question_info'])
# Replace this with however you generate and postprocess predictions.
predictions.append(model.generate(row['signature_with_docstring']))
metric = evaluate.load("gabeorlanski/bc_eval")
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
### Inputs
* `predictions`(`List[List[str]]`): The list of predictions for each question to execute.
* `languages`(`List[str]`): The language to use for each question.
* `question_dicts`(`List[Dict]`): The information for each question.
* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4).
* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).
### Output Values
The `bc_eval` metric outputs two things:
`metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}`
`results`: a list of dictionaries with the results from each individual prediction.
#### Values from Popular Papers
[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding):
| Language | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B |
|------------|-----------|-----------|-----------------|
| C# | 24.22 | 20.5 | **26.09** |
| C++ | **34.16** | 21.74 | 24.22 |
| Go | 19.25 | 13.66 | **21.12** |
| Haskell | **8.7** | 1.86 | 1.86 |
| Java | **31.06** | 20.5 | 25.47 |
| JavaScript | **32.3** | 23.6 | 29.81 |
| Julia | **16.77** | 2.48 | 4.35 |
| Lua | **26.09** | 19.25 | 24.84 |
| PHP | **26.09** | 18.63 | 25.47 |
| Python | **34.16** | 17.39 | 26.71 |
| Rust | **28.57** | 16.15 | 22.98 |
| TypeScript | **32.3** | 17.39 | 30.43 |
### Examples
Full example with inputs that fail tests, time out, have an error, and pass.
#### Passing Example
```Python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
```
`results` is:
```
[
{
"qid": 0,
"idx": "0",
"file_path": ".../tmpqt_p3dwn/0",
"results": [
{
"return_code": 0,
"runtime": 0.076369,
"stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n",
"stderr": "",
"timed_out": false,
}
],
"failed": false,
"timed_out": false,
"test_cases": {
"0": "PASSED",
"1": "PASSED",
"2": "PASSED",
"3": "PASSED",
"4": "PASSED",
"5": "PASSED",
"6": "PASSED",
},
"outcome": "PASSED",
}
]
```
#### Fails Test Example
```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = elem - elem2
if distance < threshold:
return True
return False"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
```
Note that the individual test results are located in results.
#### Timeout Example
```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
time.sleep(100)
"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```
[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
```
#### Error Example
```python
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("gabeorlanski/bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
raise ValueError()
""",
"""def add(a, b):
return a+b"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
```
`metrics` is:
```
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
```
`results` is:
```[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"},
{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
```
## Limitations and Bias
This metric requires that the dataset be BabelCode compatible.
## Citation
```
@article{orlanski2023measuring,
title={Measuring The Impact Of Programming Language Distribution},
author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
journal={arXiv preprint arXiv:2302.01973},
year={2023}
}
```