In this notebook, we will show how to compute AUC-PR (Area under Curve - Precision-Recall) metric in sklearn, tf1 and tf2.

The goal is both to explain their APIs, as well as comparing their difference when used with different parameters.


Above is an example of PR curve. As you can see, for a multi-class classification task, there are per-class PR curves, as well as their micro-average PR Curve. If you are more interested in the performance for a specific class, you can focus on its per-class metric. Otherwise, you can focus on its micro/macro average.

How is per-class PR Curve plotted: Basically we enumerate a threshold $t$ from 0 to 1 (for example, 0, 0.1, …., 1.0), for each example’s prediction score (probability) $y_i \in y_\text{pred}$, if $y_i > t$, we consider this example as a positive example, otherwise, it is considered as a negative example. By comparing with the ground-truths $y_\text{true}$, we can compute precision and recall (see for definitions. We can plot one (precision, recall) pair for each $t$, which eventually becomes PR-curve.

By computing the areas below the PR-curve, we have AUC (Area Under Curve).

Computing PR curve is affected by many parameters as well. Here we will focus on one parameter:

summation_method: Specifies the Riemann summation method used ( ‘trapezoidal’ [default] that applies the trapezoidal rule; ‘careful_interpolation’, a variant of it differing only by a more correct interpolation scheme for PR-AUC - interpolating (true/false) positives but not the ratio that is precision; ‘minoring’ that applies left summation for increasing intervals and right summation for decreasing intervals; ‘majoring’ that does the opposite. Note that ‘careful_interpolation’ is strictly preferred to ‘trapezoidal’ (to be deprecated soon) as it applies the same method for ROC, and a better one (see Davis & Goadrich 2006 for details) for the PR curve.

import sklearn.metrics as skmetrics
import tensorflow.keras.metrics as tf2metrics
import tensorflow.compat.v1 as tf1
import numpy as np
import pandas as pd

def sk_auc_pr(y_true, y_prob):
  precisions, recalls, thresholds = skmetrics.precision_recall_curve(y_true, y_prob)
  return skmetrics.auc(recalls, precisions)

def tf1_auc_pr_careful(y_true, y_prob):
  with tf1.Session() as sess:
    metric_val, update_op = tf1.metrics.auc(y_true, y_prob, curve="PR",

def tf1_auc_pr_trapezoidal(y_true, y_prob):
  with tf1.Session() as sess:
    metric_val, update_op = tf1.metrics.auc(y_true, y_prob, curve="PR")

def tf2_auc_pr(y_true, y_prob):
  m = tf2metrics.AUC(curve="PR")
  m.update_state(y_true, y_prob)
  return m.result().numpy()

y_true = np.array([0, 0, 1, 1])
y_prob = np.array([0.1, 0.4, 0.35, 0.8])

df_stats = pd.DataFrame()

for f in [sk_auc_pr, tf1_auc_pr_careful, tf1_auc_pr_trapezoidal, tf2_auc_pr]:
  df_stats = df_stats.append({
      "Method": f.__name__,
      "AUC PR": f(y_true, y_prob)
  }, ignore_index=True)

df_stats[["Method", "AUC PR"]]
  Method AUC PR
0 sk_auc_pr 0.791667
1 tf1_auc_pr_careful 0.797267
2 tf1_auc_pr_trapezoidal 0.791666
3 tf2_auc_pr 0.797267

As we can see, sk_auc_pr is basically identical with tf1_auc_pr_trapezoidal, while tf2_auc_pr is identical with tf1_auc_pr_careful. This means that sk_auc_pr and tf1_auc_pr are all using trapezoidal summation method (by default), while tf2_auc_pr are using careful summation method by default. The careful summation method is more recommended.