arthurvqin commited on
Commit
175817c
1 Parent(s): 44c4d29

Add my new, shiny module.

Browse files
Files changed (3) hide show
  1. README.md +66 -21
  2. pr_auc.py +65 -57
  3. requirements.txt +2 -1
README.md CHANGED
@@ -1,50 +1,95 @@
1
  ---
2
  title: PR AUC
3
- datasets:
4
- -
5
- tags:
6
- - evaluate
7
- - metric
8
- description: "TODO: add a description here"
9
  sdk: gradio
10
  sdk_version: 3.19.1
11
  app_file: app.py
12
  pinned: false
 
 
 
 
13
  ---
14
 
15
  # Metric Card for PR AUC
16
-
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
-
19
  ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
 
 
 
 
21
 
22
  ## How to Use
23
  *Give general statement of how to use the metric*
24
-
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
 
26
 
27
  ### Inputs
28
- *List all input arguments in the format below*
29
  - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
-
 
 
 
 
 
 
 
 
 
 
31
  ### Output Values
 
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
 
 
 
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
36
 
37
  #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
  ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
  ## Citation
47
- *Cite the source where this metric was introduced.*
48
 
49
  ## Further References
50
- *Add any useful further references.*
 
1
  ---
2
  title: PR AUC
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
 
 
 
6
  sdk: gradio
7
  sdk_version: 3.19.1
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ description: "This metric computes the area under the curve (AUC) for the Precision-Recall Curve (PR). summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight."
14
  ---
15
 
16
  # Metric Card for PR AUC
 
 
 
17
  ## Metric Description
18
+ This metric computes the area under the curve (AUC) for the Precision-Recall Curve (PR). summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
19
+
20
+ You should use this metric:
21
+ - when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
22
+ - when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).
23
 
24
  ## How to Use
25
  *Give general statement of how to use the metric*
26
+ This metric requires references and prediction scores:
27
+ ```python
28
+ >>> average_precision_score = evaluate.load("pr_auc")
29
+ >>> refs = np.array([0, 0, 1, 1])
30
+ >>> pred_scores = np.array([0.1, 0.4, 0.35, 0.8])
31
+ >>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
32
+ >>> print(round(results['average_precision'], 2))
33
+ 0.83
34
+ ```
35
+ Default implementation of this metric is binary. If using multiclass, see examples below.
36
 
37
  ### Inputs
 
38
  - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
39
+ Args:
40
+ - **`references`** (array-like of shape (n_samples,) or (n_samples, n_classes)): True binary labels or binary label indicators.
41
+ - prediction_scores (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
42
+ - **`average`** (`str`): Type of average, and is ignored in the binary use case. Defaults to `'macro'`. Options are:
43
+ - `'micro'`: Calculate metrics globally by considering each element of the label indicator matrix as a label.
44
+ - `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
45
+ - `'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
46
+ - `'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
47
+ - `None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
48
+ - **`pos_label`** (`int`, `float`, `bool` or `str`): The label of the positive class. Only applied to binary y_true. For multilabel-indicator y_true, pos_label is fixed to 1.
49
+ - **`sample_weight`** (array-like of shape (n_samples,)): Sample weights. Defaults to None.
50
  ### Output Values
51
+ This metric returns a dict containing the `average_precision` score. The score is a `float`.
52
 
53
+ The output therefore generally takes the following format:
54
+ ```python
55
+ {'average_precision': 0.778}
56
+ ```
57
 
58
+ PR AUC scores can take on any value between `0` and `1`, inclusive.
59
 
60
  #### Values from Popular Papers
 
61
 
62
  ### Examples
63
+ Example 1, the **binary** use case:
64
+ ```python
65
+ >>> average_precision_score = evaluate.load("pr_auc")
66
+ >>> refs = np.array([0, 0, 1, 1])
67
+ >>> pred_scores = np.array([0.1, 0.4, 0.35, 0.8])
68
+ >>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
69
+ >>> print(round(results['average_precision'], 2))
70
+ 0.83
71
+ ```
72
+
73
+ Example 2, the **multiclass** use case:
74
+ ```python
75
+ >>> average_precision_score = evaluate.load("pr_auc")
76
+ >>> refs = np.array([0, 0, 1, 1, 2, 2])
77
+ >>> pred_scores = np.array([[0.7, 0.2, 0.1],
78
+ ... [0.4, 0.3, 0.3],
79
+ ... [0.1, 0.8, 0.1],
80
+ ... [0.2, 0.3, 0.5],
81
+ ... [0.4, 0.4, 0.2],
82
+ ... [0.1, 0.2, 0.7]])
83
+ >>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
84
+ >>> print(round(results['average_precision'], 2))
85
+ 0.77
86
+ ```
87
 
88
  ## Limitations and Bias
89
+
90
 
91
  ## Citation
92
+
93
 
94
  ## Further References
95
+ This implementation is a wrapper around the [Scikit-learn implementation]("https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html"). Much of the documentation here was adapted from their existing documentation, as well.
pr_auc.py CHANGED
@@ -1,4 +1,4 @@
1
- # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
  #
3
  # Licensed under the Apache License, Version 2.0 (the "License");
4
  # you may not use this file except in compliance with the License.
@@ -11,85 +11,93 @@
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
- """TODO: Add a description here."""
15
 
16
- import evaluate
17
  import datasets
 
18
 
 
19
 
20
- # TODO: Add BibTeX citation
21
- _CITATION = """\
22
- @InProceedings{huggingface:module,
23
- title = {A great new module},
24
- authors={huggingface, Inc.},
25
- year={2020}
26
- }
27
- """
28
 
29
- # TODO: Add description of the module here
30
- _DESCRIPTION = """\
31
- This new module is designed to solve this great ML task and is crafted with a lot of care.
32
- """
33
 
 
 
 
 
34
 
35
- # TODO: Add description of the arguments of the module here
36
  _KWARGS_DESCRIPTION = """
37
- Calculates how good are predictions given some references, using certain scores
38
  Args:
39
- predictions: list of predictions to score. Each predictions
40
- should be a string with tokens separated by spaces.
41
- references: list of reference for each prediction. Each
42
- reference should be a string with tokens separated by spaces.
 
 
 
 
 
 
43
  Returns:
44
- accuracy: description of the first score,
45
- another_score: description of the second score,
46
- Examples:
47
- Examples should be written in doctest format, and should illustrate how
48
- to use the function.
 
 
 
 
49
 
50
- >>> my_new_module = evaluate.load("my_new_module")
51
- >>> results = my_new_module.compute(references=[0, 1], predictions=[0, 1])
52
- >>> print(results)
53
- {'accuracy': 1.0}
 
 
 
 
 
 
 
 
54
  """
55
 
56
- # TODO: Define external resources urls if needed
57
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
58
 
59
 
60
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
61
  class PRAUC(evaluate.Metric):
62
- """TODO: Short description of my evaluation module."""
63
-
64
  def _info(self):
65
- # TODO: Specifies the evaluate.EvaluationModuleInfo object
66
  return evaluate.MetricInfo(
67
- # This is the description that will appear on the modules page.
68
- module_type="metric",
69
  description=_DESCRIPTION,
70
  citation=_CITATION,
71
  inputs_description=_KWARGS_DESCRIPTION,
72
- # This defines the format of each prediction and reference
73
- features=datasets.Features({
74
- 'predictions': datasets.Value('int64'),
75
- 'references': datasets.Value('int64'),
76
- }),
77
- # Homepage of the module for documentation
78
- homepage="http://module.homepage",
79
- # Additional links to the codebase or references
80
- codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
81
- reference_urls=["http://path.to.reference.url/new_module"]
82
  )
83
 
84
- def _download_and_prepare(self, dl_manager):
85
- """Optional: download external resources useful to compute the scores"""
86
- # TODO: Download external resources if needed
87
- pass
88
-
89
- def _compute(self, predictions, references):
90
- """Returns the scores"""
91
- # TODO: Compute the different scores of the module
92
- accuracy = sum(i == j for i, j in zip(predictions, references)) / len(predictions)
93
  return {
94
- "accuracy": accuracy,
 
 
 
 
 
 
95
  }
 
1
+ # Copyright 2023 The HuggingFace Datasets Authors and the current dataset script contributor.
2
  #
3
  # Licensed under the Apache License, Version 2.0 (the "License");
4
  # you may not use this file except in compliance with the License.
 
11
  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
  # See the License for the specific language governing permissions and
13
  # limitations under the License.
14
+ """Accuracy metric."""
15
 
 
16
  import datasets
17
+ from sklearn.metrics import average_precision_score
18
 
19
+ import evaluate
20
 
 
 
 
 
 
 
 
 
21
 
22
+ _DESCRIPTION = """
23
+ This metric computes the area under the curve (AUC) for the Precision-Recall Curve (PR). summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
 
 
24
 
25
+ You should use this metric:
26
+ - when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
27
+ - when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).
28
+ """
29
 
 
30
  _KWARGS_DESCRIPTION = """
 
31
  Args:
32
+ - references (array-like of shape (n_samples,) or (n_samples, n_classes)): True binary labels or binary label indicators.
33
+ - prediction_scores (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
34
+ - average (`str`): Type of average, and is ignored in the binary use case. Defaults to 'macro'. Options are:
35
+ - `'micro'`: Calculate metrics globally by considering each element of the label indicator matrix as a label.
36
+ - `'macro'`: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
37
+ - `'weighted'`: Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
38
+ - `'samples'`: Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
39
+ - `None`: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
40
+ - pos_label (int, float, bool or str): The label of the positive class. Only applied to binary y_true. For multilabel-indicator y_true, pos_label is fixed to 1.
41
+ - sample_weight (array-like of shape (n_samples,)): Sample weights. Defaults to None.
42
  Returns:
43
+ average_precision (`float` or array-like of shape (n_classes,)): Returns `float` of average precision score.
44
+ Example:
45
+ Example 1:
46
+ >>> average_precision_score = evaluate.load("pr_auc")
47
+ >>> refs = np.array([0, 0, 1, 1])
48
+ >>> pred_scores = np.array([0.1, 0.4, 0.35, 0.8])
49
+ >>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
50
+ >>> print(round(results['average_precision'], 2))
51
+ 0.83
52
 
53
+ Example 2:
54
+ >>> average_precision_score = evaluate.load("pr_auc")
55
+ >>> refs = np.array([0, 0, 1, 1, 2, 2])
56
+ >>> pred_scores = np.array([[0.7, 0.2, 0.1],
57
+ ... [0.4, 0.3, 0.3],
58
+ ... [0.1, 0.8, 0.1],
59
+ ... [0.2, 0.3, 0.5],
60
+ ... [0.4, 0.4, 0.2],
61
+ ... [0.1, 0.2, 0.7]])
62
+ >>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
63
+ >>> print(round(results['roc_auc'], 2))
64
+ 0.77
65
  """
66
 
67
+ _CITATION = """
68
+ """
69
 
70
 
71
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
72
  class PRAUC(evaluate.Metric):
 
 
73
  def _info(self):
 
74
  return evaluate.MetricInfo(
 
 
75
  description=_DESCRIPTION,
76
  citation=_CITATION,
77
  inputs_description=_KWARGS_DESCRIPTION,
78
+ features=datasets.Features(
79
+ {
80
+ "prediction_scores": datasets.Sequence(datasets.Value("float")),
81
+ "references": datasets.Value("int32"),
82
+ }
83
+ ),
84
+ reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html"],
 
 
 
85
  )
86
 
87
+ def _compute(
88
+ self,
89
+ references,
90
+ prediction_scores,
91
+ average="macro",
92
+ sample_weight=None,
93
+ pos_label=1,
94
+ ):
 
95
  return {
96
+ "average_precision": average_precision_score(
97
+ references,
98
+ prediction_scores,
99
+ average=average,
100
+ sample_weight=sample_weight,
101
+ pos_labels=pos_label,
102
+ )
103
  }
requirements.txt CHANGED
@@ -1 +1,2 @@
1
- git+https://github.com/huggingface/evaluate@main
 
 
1
+ git+https://github.com/huggingface/evaluate@main
2
+ scikit-learn