Spaces:

gabeorlanski
/

bc_eval

Runtime error

App Files Files Community

gabeorlanski commited on Jul 18, 2023

Commit

1359055

unverified ·

1 Parent(s): 9c8145f

BC eval

Browse files

Files changed (5) hide show

README.md +242 -5
app.py +5 -0
bc_eval.py +335 -0
execution.py +145 -0
requirements.txt +1 -0

README.md CHANGED Viewed

@@ -1,12 +1,249 @@
 ---
-title: Bc Eval
-emoji: 😻
-colorFrom: pink
 colorTo: red
 sdk: gradio
-sdk_version: 3.36.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: BabelCode Eval
+colorFrom: blue
 colorTo: red
 sdk: gradio
+sdk_version: 3.19.1
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
+description: >-
+  This metric implements the evaluation harness for datasets translated with the
+  BabelCode framework as described in the paper "Measuring The Impact Of
+  Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
 ---
+# Metric Card for bc_eval
+## Metric Description
+This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
+## How to Use
+1. Generate predictions for BabelCode supported datasets
+2. Aggregate the predictions by their question.
+3. With the aggregated predictions for each question, add the `question_info` from the original BabelCode dataset.
+4. Run the metric on the `predictions`, `languages`, and `question_infos`.
+5. The result of the metric is a tuple where the first is a metric dict and the second value is the results for each prediction.
+```python
+import evaluate
+from datasets import load_dataset
+import os
+os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+predictions = []
+languages = []
+question_infos = []
+ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
+for row in ds:
+    languages.append(row['language'])
+    question_infos.append(row['question_info'])
+    # Replace this with however you generate and postprocess predictions.
+    predictions.append(model.generate(row['signature_with_docstring']))
+metric = evaluate.load("bc_eval")
+metrics, results = metric.compute(
+    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
+)
+```
+### Inputs
+* `predictions`(`List[List[str]]`): The list of predictions for each question to execute.
+* `languages`(`List[str]`): The language to use for each question.
+* `question_dicts`(`List[Dict]`): The information for each question.
+* `k`(`List[int]`): number of code candidates to consider in the evaluation (Default: [1, 10, 100])
+* `num_workers`(`int`): number of workers used to evaluate the candidate programs (Default: 4).
+* `language_timeout`(`Dict[str,int]`): Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).
+### Output Values
+The `bc_eval` metric outputs two things:
+* `metrics`: a dictionary with the pass rates for each k value defined in the arguments and the mean percent of tests passed per question. The keys are formatted as `{LANGUAGE NAME}/{METRIC NAME}`
+* `results`: a list of dictionaries with the results from each individual prediction.
+#### Values from Popular Papers
+[PaLM-2](https://arxiv.org/pdf/2305.10403.pdf) Performance on BC-HumanEval (`pass@1` with greedy decoding):
+| Language   | PaLM 2-S* | PaLM 540B | PaLM-Coder-540B |
+|------------|-----------|-----------|-----------------|
+| C#         | 24.22     | 20.5      | **26.09**           |
+| C++        | **34.16**     | 21.74     | 24.22           |
+| Go         | 19.25     | 13.66     | **21.12**           |
+| Haskell    | **8.7**       | 1.86      | 1.86            |
+| Java       | **31.06**     | 20.5      | 25.47           |
+| JavaScript | **32.3**      | 23.6      | 29.81           |
+| Julia      | **16.77**     | 2.48      | 4.35            |
+| Lua        | **26.09**     | 19.25     | 24.84           |
+| PHP        | **26.09**     | 18.63     | 25.47           |
+| Python     | **34.16**     | 17.39     | 26.71           |
+| Rust       | **28.57**     | 16.15     | 22.98           |
+| TypeScript | **32.3**      | 17.39     | 30.43           |
+### Examples
+Full example with inputs that fail tests, time out, have an error, and pass.
+#### Passing Example
+```python
+import evaluate
+from datasets import load_dataset
+import os
+os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+ds = load_dataset("gabeorlanski/bc-humaneval", split="test")
+example = ds[0]
+metric = evaluate.load("bc_eval")
+languages = ["Python"]
+question_infos = [example["question_info"]]
+predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    for idx, elem in enumerate(numbers):
+        for idx2, elem2 in enumerate(numbers):
+            if idx != idx2:
+                distance = abs(elem - elem2)
+                if distance < threshold:
+                    return True
+    return False"""
+]]
+metrics, results = metric.compute(
+    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
+)
+```
+`metrics` is:
+```
+{"Python/pass@1": 1.0, "Python/mean_pct_pass": 1.0}
+```
+`results` is:
+```
+[{"qid": 0, "idx": "0", "file_path": ".../tmpqt_p3dwn/0", "results": [{"return_code": 0, "runtime": 0.076369, "stdout": "TEST-0...PASSED\r\nTEST-1...PASSED\r\nTEST-2...PASSED\r\nTEST-3...PASSED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...PASSED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "PASSED", "2": "PASSED", "3": "PASSED", "4": "PASSED", "5": "PASSED", "6": "PASSED"}, "outcome": "PASSED"}]
+```
+#### Fails Test Example
+```python
+import evaluate
+from datasets import load_dataset
+import os
+os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+ds = load_dataset(
+        "gabeorlanski/bc-humaneval", "Python", split="test"
+    )
+example = ds[0]
+metric = evaluate.load("bc_eval")
+languages = ["Python"]
+question_infos = [example["question_info"]]
+predictions = [["""def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    for idx, elem in enumerate(numbers):
+        for idx2, elem2 in enumerate(numbers):
+            if idx != idx2:
+                distance = elem - elem2
+                if distance < threshold:
+                    return True
+    return False"""
+]]
+metrics, results = metric.compute(
+    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
+)
+```
+`metrics` is:
+```
+{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.5714285714285714}
+```
+`results` is:
+```
+[{"qid": 0, "idx": "0", "file_path": "/tmp7u587vk5/0", "results": [{"return_code": 0, "runtime": 0.08255, "stdout": "TEST-0...PASSED\r\nTEST-1...FAILED\r\nTEST-2...PASSED\r\nTEST-3...FAILED\r\nTEST-4...PASSED\r\nTEST-5...PASSED\r\nTEST-6...FAILED\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "PASSED", "1": "FAILED", "2": "PASSED", "3": "FAILED", "4": "PASSED", "5": "PASSED", "6": "FAILED"}, "outcome": "FAILED"}]
+```
+Note that the individual test results are located in results.
+#### Timeout Example
+```python
+import evaluate
+from datasets import load_dataset
+import os
+os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+ds = load_dataset(
+        "gabeorlanski/bc-humaneval", "Python", split="test"
+    )
+example = ds[0]
+metric = evaluate.load("bc_eval")
+languages = ["Python"]
+question_infos = [example["question_info"]]
+predictions = [["""import time
+def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    time.sleep(100)
+    """
+]]
+metrics, results = metric.compute(
+    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
+)
+```
+`metrics` is:
+```
+{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
+```
+`results` is:
+```
+[{"qid": 0, "idx": "0", "file_path": "/tmp_rz6bhb9/0", "results": [{"return_code": -1, "runtime": 10, "stdout": null, "stderr": null, "timed_out": true}], "failed": false, "timed_out": true, "test_cases": {"0": "MISSING", "1": "MISSING", "2": "MISSING", "3": "MISSING", "4": "MISSING", "5": "MISSING", "6": "MISSING"}, "outcome": "TIMED_OUT"}]
+```
+#### Error Example
+```python
+import evaluate
+from datasets import load_dataset
+import os
+os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+ds = load_dataset(
+        "gabeorlanski/bc-humaneval", "Python", split="test"
+    )
+example = ds[0]
+metric = evaluate.load("bc_eval")
+languages = ["Python"]
+question_infos = [example["question_info"]]
+predictions = [["""import time
+def has_close_elements(numbers: List[float], threshold: float) -> bool:
+    raise ValueError()
+    """,
+    """def add(a, b):
+    return a+b"""
+]]
+metrics, results = metric.compute(
+    predictions=predictions, languages=languages, question_dicts=question_infos, k=[1]
+)
+```
+`metrics` is:
+```
+{"Python/pass@1": 0.0, "Python/mean_pct_pass": 0.0}
+```
+`results` is:
+```
+[{"qid": 0, "idx": "0", "file_path": "/tmpjdn51aaa/0", "results": [{"return_code": 0, "runtime": 0.102855, "stdout": "TEST-0...ValueError\r\nTEST-1...ValueError\r\nTEST-2...ValueError\r\nTEST-3...ValueError\r\nTEST-4...ValueError\r\nTEST-5...ValueError\r\nTEST-6...ValueError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "ValueError", "1": "ValueError", "2": "ValueError", "3": "ValueError", "4": "ValueError", "5": "ValueError", "6": "ValueError"}, "outcome": "HAD_ERROR"},
+{"qid": 0, "idx": "1", "file_path": "/tmpjdn51aaa/1", "results": [{"return_code": 0, "runtime": 0.094347, "stdout": "TEST-0...NameError\r\nTEST-1...NameError\r\nTEST-2...NameError\r\nTEST-3...NameError\r\nTEST-4...NameError\r\nTEST-5...NameError\r\nTEST-6...NameError\r\n", "stderr": "", "timed_out": false}], "failed": false, "timed_out": false, "test_cases": {"0": "NameError", "1": "NameError", "2": "NameError", "3": "NameError", "4": "NameError", "5": "NameError", "6": "NameError"}, "outcome": "HAD_ERROR"}]
+```
+## Limitations and Bias
+This metric requires that the dataset be BabelCode compatible.
+## Citation
+```
+@article{orlanski2023measuring,
+  title={Measuring The Impact Of Programming Language Distribution},
+  author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
+  journal={arXiv preprint arXiv:2302.01973},
+  year={2023}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,5 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("gabeorlanski/bc_eval")
+launch_gradio_widget(module)

bc_eval.py ADDED Viewed

	@@ -0,0 +1,335 @@

+import dataclasses
+import itertools
+import os
+import re
+import tempfile
+from collections import defaultdict
+from pathlib import Path
+import datasets
+import evaluate
+import numpy as np
+from tqdm import tqdm
+from .execution import execute_predictions
+STDOUT_PARSE_REGEX = re.compile(r"^TEST-(.+)\.\.\.(.+)$", flags=re.MULTILINE)
+_CITATION = """\
+@article{orlanski2023measuring,
+  title={Measuring The Impact Of Programming Language Distribution},
+  author={Orlanski, Gabriel and Xiao, Kefan and Garcia, Xavier and Hui, Jeffrey and Howland, Joshua and Malmaud, Jonathan and Austin, Jacob and Singh, Rishah and Catasta, Michele},
+  journal={arXiv preprint arXiv:2302.01973},
+  year={2023}
+}
+"""
+_DESCRIPTION = """\
+This metric implements the evaluation harness for datasets translated with the BabelCode framework as described in the paper "Measuring The Impact Of Programming Language Distribution" (https://arxiv.org/abs/2302.01973).
+"""
+_KWARGS_DESCRIPTION = """
+Calculates how many predictions per question pass a set of tests for the given problem.
+Args:
+    predictions: The list of predictions for each question to execute.
+    languages: The language to use for each question.
+    question_dicts: The information for each question.
+    k: number of code candidates to consider in the evaluation (Default: [1, 10, 100])
+    num_workers: number of workers used to evaluate the candidate programs (Default: 4).
+    language_timeout: Timeouts to use for each language. If it is not set, will default to the one in the question dict (Default: None).
+Returns:
+    pass_at_k: dict with pass rates for each k
+    results: dict with granular results of each unittest
+Examples:
+    >>> bc_eval = evaluate.load("bc_eval")
+    >>> predictions = [["def add(a,b):\n\treturn a+b", "def add(a,b):\n\treturn a-b"]]
+    >>> languages = ["Python"]
+    >>> question_dicts = [{"test_code": "...", "entry_fn_name": "add","entry_cls_name":"Solution", "test_case_ids":["0","1"],"test_list":"..."}]
+    >>> pass_at_k, results = code_eval.compute(predictions=predictions,languages=languages, question_dicts=question_dicts, k=[1, 2])
+    >>> print(pass_at_k)
+    {'pass@1': 0.5, 'pass@2': 1.0}
+"""
+_WARNING = """
+################################################################################
+                                  !!!WARNING!!!
+################################################################################
+The "bc_eval" metric executes untrusted model-generated code in Python.
+Although it is highly unlikely that model-generated code will do something
+overtly malicious in response to this test suite, model-generated code may act
+destructively due to a lack of model capability or alignment.
+Users are strongly encouraged to sandbox this evaluation suite so that it
+does not perform destructive actions on their host or network. For more
+information on how OpenAI sandboxes its code, see the paper "Evaluating Large
+Language Models Trained on Code" (https://arxiv.org/abs/2107.03374).
+Once you have read this disclaimer and taken appropriate precautions,
+set the environment variable HF_ALLOW_CODE_EVAL="1". Within Python you can to this
+with:
+>>> import os
+>>> os.environ["HF_ALLOW_CODE_EVAL"] = "1"
+################################################################################\
+"""
+_QUESTION_INFO_KEYS = {
+    "entry_fn_name",
+    "entry_cls_name",
+    "test_code",
+    "test_list",
+    "test_case_ids",
+}
+def make_file_and_command(
+    qid, idx, pred, question, working_dir, timeout_override=None
+):
+    file_name = f"pred.{question['extension']}"
+    pred_dir = working_dir.joinpath(idx)
+    pred_dir.mkdir(parents=True)
+    pred_file = pred_dir.joinpath(file_name)
+    with pred_file.open("w") as f:
+        code = question["test_code"]
+        code = question["test_code"].replace("PLACEHOLDER_CODE_BODY", pred)
+        code = code.replace("PLACEHOLDER_FN_NAME", question["entry_fn_name"])
+        code = code.replace("PLACEHOLDER_CLS_NAME", question["entry_cls_name"])
+        f.write(code)
+    commands = []
+    for cmd, t in zip(question["commands"], question["timeouts"]):
+        commands.append(
+            {
+                "timeout": t if timeout_override is None else timeout_override,
+                "command": [
+                    c if c != "__FILENAME__" else file_name for c in cmd
+                ],
+            }
+        )
+    return {"qid": qid, "idx": idx, "commands": commands, "cwd": pred_dir}
+def _write_preds(
+    preds,
+    languages,
+    language_timeout,
+    question_dicts,
+    tmp_dir,
+):
+    commands = []
+    question_id_to_dict = {}
+    for pred_list, l, q_dict in tqdm(
+        zip(preds, languages, question_dicts), desc="Setup", total=len(preds)
+    ):
+        qid = len(question_id_to_dict)
+        q_dict['language'] = l
+        question_id_to_dict[qid] = q_dict
+        for p in pred_list:
+            commands.append(
+                make_file_and_command(
+                    qid=qid,
+                    idx=str(len(commands)),
+                    pred=p,
+                    question=q_dict,
+                    timeout_override=language_timeout.get(l),
+                    working_dir=tmp_dir,
+                )
+            )
+    return question_id_to_dict, commands
+@evaluate.utils.file_utils.add_start_docstrings(
+    _DESCRIPTION, _KWARGS_DESCRIPTION
+)
+class BabelCodeEval(evaluate.Metric):
+    def _info(self):
+        list_keys = ["timeouts", "commands", "test_case_ids"]
+        question_info_type = {
+            k: datasets.Value(dtype="string")
+            for k in _QUESTION_INFO_KEYS
+            if k not in list_keys
+        }
+        question_info_type["test_case_ids"] = datasets.Value("string")
+        question_info_type["commands"] = datasets.Sequence(
+            datasets.Value("string")
+        )
+        question_info_type["timeouts"] = datasets.Sequence(
+            datasets.Value("int32")
+        )
+        return evaluate.MetricInfo(
+            # This is the description that will appear on the metrics page.
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("string")),
+                    "languages": datasets.Value("string"),
+                }
+            ),
+            homepage="https://github.com/google-research/babelcode",
+            codebase_urls=["https://github.com/google-research/babelcode"],
+            reference_urls=["https://github.com/google-research/babelcode"],
+        )
+    def _compute(
+        self,
+        predictions,
+        languages,
+        question_dicts,
+        k=[1, 10, 100],
+        num_workers=4,
+        language_timeout=None,
+    ):
+        """Returns the scores"""
+        if os.getenv("HF_ALLOW_CODE_EVAL", 0) != "1":
+            raise ValueError(_WARNING)
+        language_timeout = language_timeout or {}
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            working_dir = Path(tmp_dir)
+            question_map, pred_commands = _write_preds(
+                preds=predictions,
+                languages=languages,
+                language_timeout=language_timeout,
+                question_dicts=question_dicts,
+                tmp_dir=working_dir,
+            )
+            results = execute_predictions(
+                pred_commands,
+                num_workers=num_workers,
+                max_task_per_child=5,
+                garbage_collection_freq=500,
+            )
+        all_results, q_passes, q_pct = _eval_predictions(
+            results, question_map
+        )
+        assert len(q_passes) == len(q_pct)
+        metrics = {}
+        for lang in q_passes:
+            metrics.update(_calculate_metrics(lang, q_passes[lang], q_pct[lang],k_vals=k))
+        return metrics, all_results
+def _eval_single_pred(result, test_ids, num_expected_commands):
+    test_case_results = {k: "MISSING" for k in test_ids}
+    if len(result["results"]) != num_expected_commands:
+        return "HAD_ERROR", 0, test_case_results
+    last_result = result["results"][-1]
+    if last_result.timed_out:
+        return "TIMED_OUT", 0, test_case_results
+    elif last_result.return_code != 0:
+        return "HAD_ERROR", 0, test_case_results
+    elif not last_result.stdout:
+        return "HAD_ERROR", 0, test_case_results
+    for match in STDOUT_PARSE_REGEX.findall(last_result.stdout):
+        idx, test_result = match
+        if idx in test_ids:
+            if test_case_results[idx] != "MISSING":
+                return "UNKNOWN_ERROR", 0, test_case_results
+            test_case_results[idx] = test_result.strip()
+    did_test_fail = False
+    had_error = False
+    num_passed = 0
+    for r in test_case_results.values():
+        if r == "PASSED":
+            num_passed += 1
+        elif r == "FAILED":
+            did_test_fail = True
+        else:
+            had_error = True
+    if had_error:
+        return "HAD_ERROR", num_passed, test_case_results
+    if did_test_fail:
+        return "FAILED", num_passed, test_case_results
+    return "PASSED", num_passed, test_case_results
+def _eval_predictions(pred_results, question_map):
+    out = []
+    question_results = defaultdict(lambda: defaultdict(list))
+    question_pct_pass = defaultdict(lambda: defaultdict(list))
+    for p in pred_results:
+        question = question_map[p["qid"]]
+        test_cases = question["test_case_ids"]
+        num_expected_commands = len(question["commands"])
+        outcome, num_passed, test_case_results = _eval_single_pred(
+            p, test_ids=test_cases, num_expected_commands=num_expected_commands
+        )
+        p["results"] = [dataclasses.asdict(r) for r in p["results"]]
+        p["test_cases"] = test_case_results
+        p["outcome"] = outcome
+        lang = question['language']
+        question_results[lang][p["qid"]].append(
+            num_passed == len(test_case_results)
+        )
+        question_pct_pass[lang][p["qid"]].append(
+            num_passed / len(test_case_results)
+        )
+        out.append(p)
+    return out, question_results, question_pct_pass
+def _calculate_metrics(lang,q_passed, q_pcts, k_vals):
+    assert len(q_passed) == len(q_pcts)
+    num_samples = np.zeros(len(q_passed))
+    num_correct = np.zeros(len(q_passed))
+    pcts_passed = np.zeros(len(q_passed))
+    for i, (k,v) in enumerate(q_passed.items()):
+        num_samples[i] = len(v)
+        num_correct[i] = sum(v)
+        pcts_passed[i] = np.mean(q_pcts[k])
+    out = {f'{lang}/pass@{k}': estimate_pass_at_k(num_samples, num_correct, k).mean() for k in k_vals}
+    out[f'{lang}/mean_pct_pass'] = np.mean(pcts_passed)
+    return out
+def estimate_pass_at_k(num_samples, num_correct, k):
+    """Estimates pass@k of each problem and returns them in an array."""
+    def estimator(n: int, c: int, k: int) -> float:
+        """Calculates 1 - comb(n - c, k) / comb(n, k)."""
+        if n - c < k:
+            return 1.0
+        return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
+    if isinstance(num_samples, int):
+        num_samples_it = itertools.repeat(num_samples, len(num_correct))
+    else:
+        assert len(num_samples) == len(num_correct)
+        num_samples_it = iter(num_samples)
+    return np.array(
+        [
+            estimator(int(n), int(c), k)
+            for n, c in zip(num_samples_it, num_correct)
+        ]
+    )

execution.py ADDED Viewed

	@@ -0,0 +1,145 @@

+import datetime
+import gc
+import multiprocessing as mp
+import pathlib
+import subprocess
+from dataclasses import dataclass
+from typing import Dict, List
+from tqdm import tqdm
+@dataclass
+class CommandResult:
+    return_code: int
+    runtime: float
+    stdout: str
+    stderr: str
+    timed_out: bool
+def safe_execute(
+    command_to_run: List[str],
+    working_dir: pathlib.Path,
+    timeout: int = 10,
+) -> CommandResult:
+    """Executes a list of commands safely.
+    Args:
+      command_to_run: The command to run.
+      working_dir: The working directory to run them in.
+      timeout Timeout.
+    Returns:
+      The result of executing the command.
+    """
+    timed_out = False
+    return_code = -1
+    runtime = timeout
+    stderr = None
+    stdout = None
+    start_time = datetime.datetime.now()
+    execution_process = subprocess.Popen(
+        command_to_run,
+        cwd=str(working_dir),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+    )
+    try:
+        outputs = execution_process.communicate(timeout=timeout)
+        stdout, stderr = outputs
+        stdout = stdout.decode('utf-8')
+        stderr = stderr.decode('utf-8')
+        runtime = (datetime.datetime.now() - start_time).total_seconds()
+        return_code = execution_process.returncode
+    except subprocess.TimeoutExpired:
+        timed_out = True
+        runtime = timeout
+    finally:
+        execution_process.kill()
+    return CommandResult(
+        return_code=return_code,
+        runtime=runtime,
+        stderr=stderr,
+        stdout=stdout,
+        timed_out=timed_out,
+    )
+def execute_code(sample: Dict):
+    """Execute a file of code.
+    Args:
+        sample: The sample to run.
+    Returns:
+        The execution result.
+    """
+    file_path = sample["cwd"]
+    working_dir_for_execution = (
+        file_path.parent if file_path.is_file() else file_path
+    )
+    working_dir_for_execution = working_dir_for_execution.resolve().absolute()
+    timed_out = False
+    failed = False
+    results = []
+    for command in sample['commands']:
+        res = safe_execute(command['command'], working_dir=working_dir_for_execution, timeout=command['timeout'])
+        results.append(res)
+        if res.timed_out:
+            timed_out = True
+            break
+        if res.return_code != 0:
+            failed = True
+            break
+    return {
+        "qid":sample['qid'],
+        "idx": sample["idx"],
+        "file_path": str(file_path.absolute().resolve()),
+        "results": results,
+        "failed":failed,
+        "timed_out": timed_out,
+    }
+def execute_predictions(
+    predictions: List[Dict],
+    num_workers: int = 1,
+    max_task_per_child: int = 1,
+    garbage_collection_freq: int = 500,
+):
+    """Execute a list of predictions in a specific language.
+    Args:
+        predictions: List of predictions.
+        num_workers: The number of workers to use.
+        max_task_per_child: The maximum tasks ran per child before it is killed.
+        garbage_collection_freq: How often to run garbage collection.
+    Returns:
+        The the array of raw execution results and the total runtime.
+    """
+    # Make the arguments to submit to the ThreadPoolExecutor. Do it here so we
+    # can have a progress bar as well.
+    num_to_complete = len(predictions)
+    num_completed = 0
+    results = []
+    with mp.Pool(num_workers, maxtasksperchild=max_task_per_child) as pool:
+        for result in tqdm(
+            pool.imap_unordered(execute_code, predictions),
+            total=num_to_complete,
+            desc="Executing",
+        ):
+            num_completed += 1
+            results.append(result)
+            if num_completed % garbage_collection_freq == 0:
+                gc.collect()
+        # Cleanup pool
+        pool.close()
+        pool.terminate()
+    return results

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ git+https://github.com/huggingface/evaluate@main