guydav commited on
Commit
2fd78bc
·
1 Parent(s): 4fcd593

Filled out the model card

Browse files
Files changed (2) hide show
  1. README.md +121 -17
  2. restrictedpython_code_eval.py +2 -2
README.md CHANGED
@@ -1,11 +1,11 @@
1
  ---
2
  title: RestrictedPython Code Eval
3
  datasets:
4
- -
5
  tags:
6
  - evaluate
7
  - metric
8
- description: "TODO: add a description here"
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
@@ -14,37 +14,141 @@ pinned: false
14
 
15
  # Metric Card for RestrictedPython Code Eval
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
-
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
21
 
22
  ## How to Use
23
- *Give general statement of how to use the metric*
 
 
 
 
 
 
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
36
 
37
- #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 
 
39
 
40
  ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Further References
50
- *Add any useful further references.*
 
 
1
  ---
2
  title: RestrictedPython Code Eval
3
  datasets:
4
+ - N/A (eval module only)
5
  tags:
6
  - evaluate
7
  - metric
8
+ description: "Same logic as the built-in `code_eval`, but compiling and running the code using `RestrictedPython`"
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
 
14
 
15
  # Metric Card for RestrictedPython Code Eval
16
 
 
 
17
  ## Metric Description
18
+ A code-based evaluation metric, with the same logic as [`code_eval`](https://huggingface.co/spaces/evaluate-metric/code_eval).
19
 
20
  ## How to Use
21
+ ```python
22
+ from evaluate import load
23
+ code_eval = load("RestrictedPython_code_eval")
24
+ test_cases = ["assert add(2,3)==5"]
25
+ candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
26
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], use_safe_builtins=True)
27
+ ```
28
 
29
+ N.B.
30
+ This metric exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Before running this metric and once you've taken the necessary precautions, you will need to set the `HF_ALLOW_CODE_EVAL` environment variable. Use it at your own risk:
31
+ ```python
32
+ import os
33
+ os.environ["HF_ALLOW_CODE_EVAL"] = "1"`
34
+ ```
35
 
36
  ### Inputs
37
+ The following arguments are inherited from the basic `code_eval`:
38
+
39
+ - **`predictions`** (`List[List[str]]`): a list of candidates to evaluate. Each candidate should be a list of strings with several code candidates to solve the problem.
40
+
41
+ - **`references`** (`List[str]`): a list with a test for each prediction. Each test should evaluate the correctness of a code candidate.
42
+
43
+ - **`k`** (`List[int]`): number of code candidates to consider in the evaluation. The default value is `[1, 10, 100]`.
44
+
45
+ - **`num_workers`** (`int`): the number of workers used to evaluate the candidate programs (The default value is `4`).
46
+
47
+ - **`timeout`** (`float`): The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds).
48
+
49
+ In addition, this metric supports three additional arguments, specifying which default imports should be made available:
50
+
51
+ - **`use_safe_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.safe_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Guards.py#L23), defaults to True
52
+
53
+ - **`use_limited_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.limited_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Limits.py#L14), which provides limited implementations of `range`, `list`, and `tuple`. defaults to True.
54
+
55
+ - **`use_utility_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.utility_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Utilities.py#L19), which includes the `string`, `math`, `random`, and `set` packages, among others. Defaults to True.
56
+
57
+ As the additional arguments are optional, this could be used as a drop-in replacement for `code_eval`.
58
 
59
  ### Output Values
60
 
61
+ Identical to `code_eval`:
62
+
63
+ The Code Eval metric outputs two things:
64
 
65
+ `pass_at_k`: a dictionary with the pass rates for each k value defined in the arguments.
66
 
67
+ `results`: a dictionary with granular results of each unit test.
68
+
69
+ ### Values from popular papers
70
+ The [original CODEX paper](https://arxiv.org/pdf/2107.03374.pdf) reported that the CODEX-12B model had a pass@k score of 28.8% at `k=1`, 46.8% at `k=10` and 72.3% at `k=100`. However, since the CODEX model is not open source, it is hard to verify these numbers.
71
 
72
  ### Examples
73
+ Copied from the `code_eval` model card:
74
+
75
+ Full match at `k=1`:
76
+
77
+ ```python
78
+ from evaluate import load
79
+ code_eval = load("code_eval")
80
+ test_cases = ["assert add(2,3)==5"]
81
+ candidates = [["def add(a, b): return a+b"]]
82
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
83
+ print(pass_at_k)
84
+ {'pass@1': 1.0}
85
+ ```
86
+
87
+ No match for k = 1:
88
+
89
+ ```python
90
+ from evaluate import load
91
+ code_eval = load("code_eval")
92
+ test_cases = ["assert add(2,3)==5"]
93
+ candidates = [["def add(a,b): return a*b"]]
94
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
95
+ print(pass_at_k)
96
+ {'pass@1': 0.0}
97
+ ```
98
+
99
+ Partial match at k=1, full match at k=2:
100
+
101
+ ```python
102
+ from evaluate import load
103
+ code_eval = load("code_eval")
104
+ test_cases = ["assert add(2,3)==5"]
105
+ candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]]
106
+ pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])
107
+ print(pass_at_k)
108
+ {'pass@1': 0.5, 'pass@2': 1.0}
109
+ ```
110
 
111
  ## Limitations and Bias
112
+ From the original `code_eval` model card:
113
+
114
+ As per the warning included in the metric code itself:
115
+ > This program exists to execute untrusted model-generated code. Although it is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, model-generated code may act destructively due to a lack of model capability or alignment. Users are strongly encouraged to sandbox this evaluation suite so that it does not perform destructive actions on their host or network. For more information on how OpenAI sandboxes its code, see the accompanying paper. Once you have read this disclaimer and taken appropriate precautions, uncomment the following line and proceed at your own risk:
116
+
117
+ More information about the limitations of the code can be found on the [Human Eval Github repository](https://github.com/openai/human-eval).
118
+
119
+ Additionally, this metric does not currently allow for custom `RestrictedPython` policies -- so any code that depends on non-default libraries or packages may fail for that reason.
120
+
121
+ **TODO**: Add a `use_custom_builtins` argument that allows users to specify their own `RestrictedPython` policy. See the RestrictedPython [documentation](https://restrictedpython.readthedocs.io/en/latest/usage/policy.html#implementing-a-policy) for additional details.
122
+
123
 
124
  ## Citation
125
+ Based on the original `code_eval` metric, which cites:
126
+ ```bibtex
127
+ @misc{chen2021evaluating,
128
+ title={Evaluating Large Language Models Trained on Code},
129
+ author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
130
+ and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
131
+ and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
132
+ and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
133
+ and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
134
+ and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
135
+ and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
136
+ and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
137
+ and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
138
+ and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
139
+ and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
140
+ and William Saunders and Christopher Hesse and Andrew N. Carr \
141
+ and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
142
+ and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
143
+ and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
144
+ and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
145
+ year={2021},
146
+ eprint={2107.03374},
147
+ archivePrefix={arXiv},
148
+ primaryClass={cs.LG}
149
+ }
150
+ ```
151
 
152
  ## Further References
153
+ - The original `code_eval` metric: https://huggingface.co/spaces/evaluate-metric/code_eval
154
+ - RestrictedPython: https://restrictedpython.readthedocs.io/en/latest/index.html
restrictedpython_code_eval.py CHANGED
@@ -47,7 +47,7 @@ year={2020}
47
 
48
  # TODO: Add description of the module here
49
  _DESCRIPTION = """\
50
- This module tries to extend the built in `code_eval` module to use restricted python.
51
  """
52
 
53
 
@@ -69,7 +69,7 @@ Returns:
69
  pass_at_k: dict with pass rates for each k
70
  results: dict with granular results of each unittest
71
  Examples:
72
- >>> code_eval = evaluate.load("code_eval")
73
  >>> test_cases = ["assert add(2,3)==5"]
74
  >>> candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
75
  >>> pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])
 
47
 
48
  # TODO: Add description of the module here
49
  _DESCRIPTION = """\
50
+ This module implements the same logic as the baseline `code_eval` module but using RestrictedPython.
51
  """
52
 
53
 
 
69
  pass_at_k: dict with pass rates for each k
70
  results: dict with granular results of each unittest
71
  Examples:
72
+ >>> code_eval = evaluate.load("RestrictedPython_code_eval")
73
  >>> test_cases = ["assert add(2,3)==5"]
74
  >>> candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
75
  >>> pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])