Spaces:
Runtime error
title: BabelCode Eval
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
This metric implements the evaluation harness for datasets translated with the
BabelCode framework as described in the paper "Measuring The Impact Of
Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
Metric Card for bc_eval
Metric Description
This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
How to Use
- Generate predictions for BabelCode supported datasets
- Aggregate the predictions by their question.
- With the aggregated predictions for each question, add the
question_info
from the original BabelCode dataset. - Run the metric on the
predictions
,languages
, andquestion_infos
. - The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
predictions = []
languages = []
question_infos = []
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
for row in ds:
languages.append(row['language'])
question_infos.append(row['question_info'])
# Replace this with however you generate and postprocess predictions.
predictions.append(model.generate(row['signature_with_docstring']))
metric = evaluate.load("bc_eval")
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
Inputs
predictions
(List[List[str]]
): The list of predictions for each question to execute.languages
(List[str]
): The language to use for each question.question_dicts
(List[Dict]
): The information for each question.k
(List[int]
): number of code candidates to consider in the evaluation (Default: [1, 10, 100])num_workers
(int
): number of workers used to evaluate the candidate programs (Default: 4).language_timeout
(Dict[str,int]
): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).
Output Values
The bc_eval
metric outputs two things:
metrics
: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as{LANGUAGE NAME}/{METRIC NAME}
results
: a list of dictionaries with the results from each individual prediction.
Values from Popular Papers
PaLM-2 Performance on BC-HumanEval (pass@1
with greedy decoding):
Language | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B |
---|---|---|---|
C# | 24.22 | 20.5 | 26.09 |
C++ | 34.16 | 21.74 | 24.22 |
Go | 19.25 | 13.66 | 21.12 |
Haskell | 8.7 | 1.86 | 1.86 |
Java | 31.06 | 20.5 | 25.47 |
JavaScript | 32.3 | 23.6 | 29.81 |
Julia | 16.77 | 2.48 | 4.35 |
Lua | 26.09 | 19.25 | 24.84 |
PHP | 26.09 | 18.63 | 25.47 |
Python | 34.16 | 17.39 | 26.71 |
Rust | 28.57 | 16.15 | 22.98 |
TypeScript | 32.3 | 17.39 | 30.43 |
Examples
Full example with inputs that fail tests, time out, have an error, and pass.
Passing Example
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
example = ds[0]
metric = evaluate.load("bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
metrics
is:
{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
results
is:
[{"qid": 0, "idx": "0", "file_path": ".../tmpqt_p3dwn/0", "results": [{"return_code": 0, "runtime": 0.076369, "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "PASSED", "2": "PASSED", "3": "PASSED", "4": "PASSED", "5": "PASSED", "6": "PASSED"}, "outcome": "PASSED"}]
Fails Test Example
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = elem - elem2
if distance < threshold:
return True
return False"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
metrics
is:
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
results
is:
[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
Note that the individual test results are located in results.
Timeout Example
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
time.sleep(100)
"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
metrics
is:
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
results
is:
[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
Error Example
import evaluate
from datasets import load_dataset
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"
ds = load_dataset(
"gabeorlanski/bc-humaneval", "Python", split="test"
)
example = ds[0]
metric = evaluate.load("bc_eval")
languages = ["Python"]
question_infos = [example["question_info"]]
predictions = [["""import time
def has_close_elements(numbers: List[float], threshold: float) -> bool:
raise ValueError()
""",
"""def add(a, b):
return a+b"""
]]
metrics, results = metric.compute(
predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
)
metrics
is:
{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
results
is:
[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"},
{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
Limitations and Bias
This metric requires that the dataset be BabelCode compatible.
Citation
@article{orlanski2023measuring,
title={Measuring The Impact Of Programming Language Distribution},
author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
journal={arXiv preprint arXiv:2302.01973},
year={2023}
}