--- title: BabelCode Eval colorFrom: blue colorTo: red sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false tags: - evaluate - metric description: >- This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973). --- # Metric Card for bc_eval ## Metric Description This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973). ## How to Use 1. Generate predictions for BabelCode supported datasets 2. Aggregate the predictions by their question. 3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset. 4. Run the metric on the `predictions`, `languages`, and `question_infos`. 5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction. ```python import evaluate from datasets import load_dataset import os os.environ["HF_ALLOW_CODE_EVAL"] = "1" predictions = [] languages = [] question_infos = [] ds = load_dataset("gabeorlanski/bc-humaneval", split="test") for row in ds: languages.append(row['language']) question_infos.append(row['question_info']) # Replace this with however you generate and postprocess predictions. predictions.append(model.generate(row['signature_with_docstring'])) metric = evaluate.load("bc_eval") metrics, results = metric.compute( predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] ) ``` ### Inputs * `predictions`(`List[List[str]]`): The list of predictions for each question to execute. * `languages`(`List[str]`): The language to use for each question. * `question_dicts`(`List[Dict]`): The information for each question. * `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100]) * `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4). * `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None). ### Output Values The `bc_eval` metric outputs two things: * `metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}` * `results`: a list of dictionaries with the results from each individual prediction. #### Values from Popular Papers [PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding): | Language | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B | |------------|-----------|-----------|-----------------| | C# | 24.22 | 20.5 | **26.09** | | C++ | **34.16** | 21.74 | 24.22 | | Go | 19.25 | 13.66 | **21.12** | | Haskell | **8.7** | 1.86 | 1.86 | | Java | **31.06** | 20.5 | 25.47 | | JavaScript | **32.3** | 23.6 | 29.81 | | Julia | **16.77** | 2.48 | 4.35 | | Lua | **26.09** | 19.25 | 24.84 | | PHP | **26.09** | 18.63 | 25.47 | | Python | **34.16** | 17.39 | 26.71 | | Rust | **28.57** | 16.15 | 22.98 | | TypeScript | **32.3** | 17.39 | 30.43 | ### Examples Full example with inputs that fail tests, time out, have an error, and pass. #### Passing Example ```python import evaluate from datasets import load_dataset import os os.environ["HF_ALLOW_CODE_EVAL"] = "1" ds = load_dataset("gabeorlanski/bc-humaneval", split="test") example = ds[0] metric = evaluate.load("bc_eval") languages = ["Python"] question_infos = [example["question_info"]] predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool: for idx, elem in enumerate(numbers): for idx2, elem2 in enumerate(numbers): if idx != idx2: distance = abs(elem - elem2) if distance < threshold: return True return False""" ]] metrics, results = metric.compute( predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] ) ``` `metrics` is: ``` {"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0} ``` `results` is: ``` [{"qid": 0, "idx": "0", "file_path": ".../tmpqt_p3dwn/0", "results": [{"return_code": 0, "runtime": 0.076369, "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "PASSED", "2": "PASSED", "3": "PASSED", "4": "PASSED", "5": "PASSED", "6": "PASSED"}, "outcome": "PASSED"}] ``` #### Fails Test Example ```python import evaluate from datasets import load_dataset import os os.environ["HF_ALLOW_CODE_EVAL"] = "1" ds = load_dataset( "gabeorlanski/bc-humaneval", "Python", split="test" ) example = ds[0] metric = evaluate.load("bc_eval") languages = ["Python"] question_infos = [example["question_info"]] predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool: for idx, elem in enumerate(numbers): for idx2, elem2 in enumerate(numbers): if idx != idx2: distance = elem - elem2 if distance < threshold: return True return False""" ]] metrics, results = metric.compute( predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] ) ``` `metrics` is: ``` {"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714} ``` `results` is: ``` [{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}] ``` Note that the individual test results are located in results. #### Timeout Example ```python import evaluate from datasets import load_dataset import os os.environ["HF_ALLOW_CODE_EVAL"] = "1" ds = load_dataset( "gabeorlanski/bc-humaneval", "Python", split="test" ) example = ds[0] metric = evaluate.load("bc_eval") languages = ["Python"] question_infos = [example["question_info"]] predictions = [["""import time def has_close_elements(numbers: List[float], threshold: float) -> bool: time.sleep(100) """ ]] metrics, results = metric.compute( predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] ) ``` `metrics` is: ``` {"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0} ``` `results` is: ``` [{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}] ``` #### Error Example ```python import evaluate from datasets import load_dataset import os os.environ["HF_ALLOW_CODE_EVAL"] = "1" ds = load_dataset( "gabeorlanski/bc-humaneval", "Python", split="test" ) example = ds[0] metric = evaluate.load("bc_eval") languages = ["Python"] question_infos = [example["question_info"]] predictions = [["""import time def has_close_elements(numbers: List[float], threshold: float) -> bool: raise ValueError() """, """def add(a, b): return a+b""" ]] metrics, results = metric.compute( predictions=predictions, languages=languages, question_dicts=question_infos, k=[1] ) ``` `metrics` is: ``` {"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0} ``` `results` is: ``` [{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"}, {"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}] ``` ## Limitations and Bias This metric requires that the dataset be BabelCode compatible. ## Citation ``` @article{orlanski2023measuring, title={Measuring The Impact Of Programming Language Distribution}, author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele}, journal={arXiv preprint arXiv:2302.01973}, year={2023} } ```