metadata

title: PR AUC
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  This metric computes the area under the curve (AUC) for the Precision-Recall
  Curve (PR). summarizes a precision-recall curve as the weighted mean of
  precisions achieved at each threshold, with the increase in recall from the
  previous threshold used as the weight.

Metric Card for PR AUC

Metric Description

This metric computes the area under the curve (AUC) for the Precision-Recall Curve (PR). summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.

You should use this metric:

when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

How to Use

Give general statement of how to use the metric This metric requires references and prediction scores:

>>> average_precision_score = evaluate.load("pr_auc")
>>> refs = np.array([0, 0, 1, 1])
>>> pred_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['average_precision'], 2))
0.83

Default implementation of this metric is binary. If using multiclass, see examples below.

Inputs

input_field (type): Definition of input, with explanation if necessary. State any default value(s). Args:
references (array-like of shape (n_samples,) or (n_samples, n_classes)): True binary labels or binary label indicators.
prediction_scores (array-like of shape (n_samples,) or (n_samples, n_classes)): Model predictions. Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by decision_function on some classifiers).
average (str): Type of average, and is ignored in the binary use case. Defaults to 'macro'. Options are:
- 'micro': Calculate metrics globally by considering each element of the label indicator matrix as a label.
- 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- 'weighted': Calculate metrics for each label, and find their average, weighted by support (i.e. the number of true instances for each label).
- 'samples': Calculate metrics for each instance, and find their average. Only works with the multilabel use case.
- None: No average is calculated, and scores for each class are returned. Only works with the multilabels use case.
pos_label (int, float, bool or str): The label of the positive class. Only applied to binary y_true. For multilabel-indicator y_true, pos_label is fixed to 1.
sample_weight (array-like of shape (n_samples,)): Sample weights. Defaults to None.

Output Values

This metric returns a dict containing the average_precision score. The score is a float.

The output therefore generally takes the following format:

{'average_precision': 0.778}

PR AUC scores can take on any value between 0 and 1, inclusive.

Values from Popular Papers

Examples

Example 1, the binary use case:

>>> average_precision_score = evaluate.load("pr_auc")
>>> refs = np.array([0, 0, 1, 1])
>>> pred_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['average_precision'], 2))
0.83

Example 2, the multiclass use case:

>>> average_precision_score = evaluate.load("pr_auc")
>>> refs = np.array([0, 0, 1, 1, 2, 2])
>>> pred_scores = np.array([[0.7, 0.2, 0.1],
...                         [0.4, 0.3, 0.3],
...                         [0.1, 0.8, 0.1],
...                         [0.2, 0.3, 0.5],
...                         [0.4, 0.4, 0.2],
...                         [0.1, 0.2, 0.7]])
>>> results = average_precision_score.compute(references=refs, prediction_scores=pred_scores)
>>> print(round(results['average_precision'], 2))
0.77

Limitations and Bias

Citation

Further References

This implementation is a wrapper around the Scikit-learn implementation. Much of the documentation here was adapted from their existing documentation, as well.