Metrics That Matter: Evaluating AI Like a Biostatistician

Applied Statistics

AI and Clinical Decision-Making

A practical overview of discrimination, calibration, precision-recall tradeoffs, ROC curves, and decision-aware model evaluation.

Published

July 15, 2025

Modified

June 9, 2026

Executive Summary

A predictive model is only as useful as the way it is evaluated (Steyerberg 2019; Harrell 2015).

That sounds obvious, but in practice it is one of the easiest places for analysts to go wrong.

A model can look excellent by one metric and disappointing by another. A classifier can achieve high accuracy while missing most of the cases that matter. A model can rank patients well but still produce poorly calibrated probabilities (Brier 1950; Steyerberg 2019). And a model can appear strong in development while failing in deployment because the chosen evaluation metric never matched the real decision problem.

This is why model evaluation matters so much, especially in clinical settings where model utility depends on more than discrimination alone (Vickers and Elkin 2006; Harrell 2015).

In machine learning, metrics are not just scorekeeping. They define what “good” means.

This post introduces:

accuracy,
precision,
recall,
F1 score,
ROC curves,
AUC,
and related evaluation logic for classification,
with a medical-diagnostics style case study (Hanley and McNeil 1982; Davis and Goadrich 2006).

Along the way, I also connect these ideas to regression-style evaluation and to the challenge of imbalanced outcomes.

Model evaluation matters because a model is not good simply because it predicts well in the abstract, but because it performs well on the metric that matches the real decision problem.

Model Evaluation Begins with the Decision Context

A metric is never just a number.

It is a statement about what kind of error we care about.

For example:

in spam filtering, false positives may annoy users
in medical diagnostics, false negatives may miss disease
in fraud detection, a high false positive rate may overload reviewers
in triage models, ranking high-risk cases correctly may matter more than overall accuracy

This means model evaluation should not begin with:

which metric is standard?

It should begin with:

what kind of mistakes matter most in this setting?

That is why evaluation is not merely technical. It is tied to domain context, cost, and deployment consequences.

Accuracy Is Easy to Understand and Easy to Misuse

Accuracy is the proportion of predictions that are correct.

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

where:

TP = true positives
TN = true negatives
FP = false positives
FN = false negatives

Accuracy is intuitive and often useful. But it becomes misleading when the classes are imbalanced.

For example, if only 5% of patients have a condition, a model that predicts “no disease” for everyone will still achieve 95% accuracy.

That model is useless clinically.

So accuracy is often a starting point, not a sufficient endpoint.

A Medical-Diagnostics Example Makes the Problem Concrete

To illustrate, we will simulate a medical-diagnostics style classification problem with a relatively uncommon positive outcome.

This is exactly the kind of setting where naïve reliance on accuracy can become misleading.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 500

metric_df <- tibble::tibble(
  age = rnorm(n, mean = 58, sd = 13),
  biomarker = rnorm(n, mean = 0, sd = 1),
  comorbidity_score = rnorm(n, mean = 0, sd = 1)
) |>
  dplyr::mutate(
    linpred = -2.8 + 0.03 * age + 1.1 * biomarker + 0.8 * comorbidity_score,
    prob = 1 / (1 + exp(-linpred)),
    disease = rbinom(n, size = 1, prob = prob)
  )

metric_df |>
  dplyr::summarise(
    n = dplyr::n(),
    prevalence = mean(disease)
  )

# A tibble: 1 × 2
      n prevalence
  <int>      <dbl>
1   500      0.308

This creates a binary disease outcome and predicted risk probabilities.

Thresholding Probabilities Produces a Confusion Matrix

For many metrics, probabilities must first be turned into class labels using a threshold.

A common threshold is 0.5, though that is not always appropriate.

metric_df <- metric_df |>
  dplyr::mutate(
    pred_class = if_else(prob >= 0.5, 1, 0)
  )

cm <- table(Predicted = metric_df$pred_class, Observed = metric_df$disease)
cm

         Observed
Predicted   0   1
        0 309  79
        1  37  75

This is the confusion matrix.

It is one of the most important diagnostic objects in classification because many common metrics are built from its four cells.

Precision and Recall Emphasize Different Kinds of Success

Two especially important metrics are precision and recall.

Precision

Precision asks:

among the cases the model predicted as positive, how many were truly positive?

\[ \text{Precision} = \frac{TP}{TP + FP} \]

Recall

Recall asks:

among the truly positive cases, how many did the model successfully identify?

\[ \text{Recall} = \frac{TP}{TP + FN} \]

These are not the same.

A model can have:

high precision but low recall
or high recall but low precision

That tradeoff matters a lot in healthcare and other high-stakes settings.

Computing the Core Classification Metrics in R

We can compute the confusion-matrix-based metrics directly.

tp <- cm["1", "1"]
tn <- cm["0", "0"]
fp <- cm["1", "0"]
fn <- cm["0", "1"]

metric_tbl <- tibble::tibble(
  metric = c("Accuracy", "Precision", "Recall", "Specificity"),
  value = c(
    (tp + tn) / sum(cm),
    tp / (tp + fp),
    tp / (tp + fn),
    tn / (tn + fp)
  )
)

metric_tbl

# A tibble: 4 × 2
  metric      value
  <chr>       <dbl>
1 Accuracy    0.768
2 Precision   0.670
3 Recall      0.487
4 Specificity 0.893

This is often the first useful summary table for a classifier.

But even this is only part of the full picture.

F1 Score Balances Precision and Recall

When analysts want a single metric that combines precision and recall, a common choice is the F1 score.

\[ F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

This is the harmonic mean of precision and recall.

It is useful when both kinds of performance matter and the analyst wants a single summary that penalizes a model for doing poorly on either dimension.

precision <- tp / (tp + fp)
recall <- tp / (tp + fn)

f1 <- 2 * (precision * recall) / (precision + recall)

tibble::tibble(
  metric = "F1 Score",
  value = f1
)

# A tibble: 1 × 2
  metric   value
  <chr>    <dbl>
1 F1 Score 0.564

F1 is often more informative than accuracy in imbalanced classification problems, though it still does not capture everything.

Imbalanced Outcomes Change What “Good” Looks Like

When outcomes are imbalanced, the apparent performance of a model can be deceptive.

This is especially common in:

disease detection
adverse event prediction
mortality models
fraud detection
and rare event monitoring

In these settings, a model that predicts the majority class most of the time can still look good by accuracy alone.

This is why metrics like:

recall,
precision,
F1,
and PR curves

often become more important.

A good analyst should always ask:

how common is the event, and does the evaluation metric account for that reality?

ROC Curves Show Performance Across Thresholds

A major limitation of confusion-matrix metrics is that they depend on a single classification threshold.

A ROC curve avoids that by examining performance across many thresholds.

ROC stands for receiver operating characteristic.

The ROC curve plots:

true positive rate, or recall, on the y-axis
false positive rate on the x-axis

This shows how discrimination changes as the threshold shifts.

required_pkgs <- c("pROC")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

roc_obj <- pROC::roc(metric_df$disease, metric_df$prob)
plot(roc_obj)

ROC curves are especially useful for understanding classifier discrimination independently of one fixed cutoff.

AUC Summarizes Ranking Performance, Not Calibration

The area under the ROC curve, or AUC, is a summary of discrimination.

It can be interpreted as the probability that a randomly chosen positive case receives a higher predicted score than a randomly chosen negative case.

pROC::auc(roc_obj)

AUC is helpful because it summarizes threshold-free ranking performance.

But it also has important limitations.

A model can have:

a high AUC but poor calibration
a high AUC but weak precision at clinically relevant thresholds
or a respectable AUC while still being practically unhelpful in deployment

So AUC is useful, but not sufficient by itself.

Precision-Recall Curves Are Often Better for Rare Events

When the positive class is uncommon, precision-recall curves can be more informative than ROC curves.

Why?

Because precision directly reflects the burden of false positives among predicted positives, which becomes especially important in low-prevalence settings.

This makes PR curves highly relevant in:

diagnostics
screening
alerting systems
and rare-event detection

required_pkgs <- c("PRROC")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

pr_obj <- PRROC::pr.curve(
  scores.class0 = metric_df$prob[metric_df$disease == 1],
  scores.class1 = metric_df$prob[metric_df$disease == 0],
  curve = TRUE
)

plot(pr_obj)

In medical screening settings, PR curves are often especially valuable because they reflect the tradeoff between catching cases and flooding the system with false alarms.

Threshold Choice Should Reflect Clinical or Operational Costs

A classifier does not come with one universally correct threshold.

The appropriate threshold depends on the consequences of different errors.

For example:

in screening, missing a true case may be much worse than flagging a false alarm
in an intensive-care alert system, too many false positives may cause alarm fatigue
in triage, recall may be prioritized over precision
in confirmatory diagnostics, precision may become more important

This is why classification should often be treated as a decision problem, not just a probability-to-label routine.

A model may be strong, but the chosen threshold may still be inappropriate for the use case.

Calibration Matters Too, Not Only Discrimination

A model that ranks cases well is not necessarily well calibrated.

Calibration asks:

when the model predicts a probability of 0.70, do about 70% of those cases truly experience the event?

This matters because many real-world decisions depend on the probability itself, not only the ranking.

A highly discriminative model can still be miscalibrated if its probabilities are systematically too high or too low.

That is especially important in healthcare, where probability estimates may guide escalation, counseling, or treatment decisions.

Discrimination and calibration are related, but they are not the same.

Regression Models Need Metrics Too

Although this post focuses on classification, the broader lesson applies to regression as well.

Common regression metrics include:

mean squared error (MSE)
root mean squared error (RMSE)
mean absolute error (MAE)
(R^2)

These summarize different aspects of predictive error.

For example:

RMSE penalizes large errors heavily
MAE is often more robust to outliers
(R^2) reflects explained variance, but not necessarily deployment usefulness

The general principle remains the same: the choice of metric should match the real modeling objective.

A Small Regression Metric Example Completes the Picture

To keep that broader perspective visible, here is a quick regression example.

set.seed(20260315)

reg_df <- tibble::tibble(
  y_true = rnorm(150, mean = 10, sd = 3)
) |>
  dplyr::mutate(
    y_pred = y_true + rnorm(150, mean = 0, sd = 1.5)
  )

reg_metrics <- reg_df |>
  dplyr::summarise(
    mse = mean((y_true - y_pred)^2),
    rmse = sqrt(mean((y_true - y_pred)^2)),
    mae = mean(abs(y_true - y_pred)),
    r_squared = cor(y_true, y_pred)^2
  )

reg_metrics

# A tibble: 1 × 4
    mse  rmse   mae r_squared
  <dbl> <dbl> <dbl>     <dbl>
1  2.17  1.47  1.18     0.830

Even here, no single metric fully captures model usefulness. The right one depends on the error structure that matters most.

Fair Model Assessment Requires More Than One Metric

A common mistake is to report only the metric that makes the model look best.

That is not robust evaluation.

In most serious applications, analysts should report multiple metrics because models can behave differently across dimensions such as:

discrimination
calibration
threshold-specific error
sensitivity to imbalance
subgroup performance

This is especially important in healthcare and other high-stakes settings, where a model can appear strong overall while performing poorly for a subgroup or on the specific error type that matters most.

Good evaluation is therefore multidimensional.

Metrics Should Match Deployment, Not Just Development

One of the most important practical questions is:

how will this model actually be used?

If the model will rank patients for review, AUC may be relevant. If the model will trigger an alert, threshold-based recall and precision matter. If the model will estimate risk for counseling or allocation, calibration becomes central.

This is why analysts should avoid evaluating a model only in the style most convenient for development.

The metrics should reflect the deployment logic.

Otherwise, a model may look impressive in development and still fail operationally.

A Practical Checklist for Applied Work

Before reporting a model’s performance, ask:

Is the outcome balanced or imbalanced?
Does accuracy actually mean anything useful here?
Are precision and recall more relevant than overall correctness?
Have I examined ROC and PR behavior?
Does the metric reflect threshold-free ranking, threshold-based decisions, or calibrated probability estimation?
Would subgroup-specific evaluation reveal hidden weaknesses?
Does the evaluation metric match the real deployment use case?

These questions usually matter more than squeezing out one more decimal point of AUC.

Where This Shows Up in AI/ML

Epic’s sepsis prediction model (formerly the Sepsis Early Warning Tool, now embedded in Deterioration Index) has been reported in multiple external validations to have AUCs in the 0.74–0.83 range — but calibration analyses at several institutions showed that the model’s stated probabilities were systematically too high, meaning that a “60% risk” alert corresponded to observed event rates closer to 20–30%, directly misleading clinicians about how aggressively to escalate. The distinction between discrimination (ranking high-risk patients above low-risk patients) and calibration (outputting accurate absolute probabilities) is the difference between a tool that correctly sorts a triage line and a tool whose numerical output can guide treatment decisions safely. In DoDTR-based mortality modeling, a model with strong AUC but poor calibration is particularly dangerous because downstream resource allocation decisions — blood product preposition, surgical team activation — depend on the probability number, not just the rank order.

Closing: Good Evaluation Means Measuring What Actually Matters

Model evaluation remains one of the most important parts of machine learning because a model is only useful if its performance is judged in the right way.

Accuracy is simple, but often insufficient. Precision and recall clarify different types of classification success. F1 balances them when both matter. ROC curves and AUC summarize discrimination across thresholds. PR curves become especially important when positives are rare.

And beyond all of these lies the broader principle:

the best metric is the one that matches the real decision problem.

That is why evaluating AI like a biostatistician is so valuable. It keeps the focus on consequences, not just scores.

Metrics matter because every performance number quietly encodes a judgment about what kind of model error we are willing to live with.

📚 Go Deeper: Calibration Toolkit

This post is part of the Calibration Toolkit — a companion reference with confusion matrix templates, ROC and PR curve code, calibration plot scaffolds, and threshold selection guidance for clinical prediction models.

→ Open the Calibration Toolkit

Series Callout

Note

This post is part of a broader Applied Statistics for AI and Clinical Decision-Making Series:

Probability fundamentals for machine learning
Random variables and expectation
Common probability distributions
Central Limit Theorem
Law of Large Numbers
Sampling methods for Biostats and ML
Hypothesis testing in the age of AI
Confidence intervals
Maximum likelihood estimation
Bayesian inference
Linear regression
Logistic regression
Generalized linear models
Analysis of variance
Principal component analysis
Cluster analysis
Time series analysis
Survival analysis
Non-parametric methods
Bias-variance tradeoff
Regularization
Cross-validation
Information theory
Optimization techniques
Linear algebra basics
Calculus for ML
Monte Carlo methods
Dimensionality curse and reduction techniques
Model evaluation metrics
Ensemble methods

Series: Applied Statistics for AI & Clinical Decision-Making

← Beating the Curse of Dimensionality in ML | Ensembles: Supercharging Stats into AI Powerhouses →

References

Brier, Glenn W. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review 78 (1): 1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.

Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.” Proceedings of the 23rd International Conference on Machine Learning, 233–40. https://doi.org/10.1145/1143844.1143874.

Hanley, James A., and Barbara J. McNeil. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology 143 (1): 29–36. https://doi.org/10.1148/radiology.143.1.7063747.

Harrell, Jr., Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.

Vickers, Andrew J., and Elkin B. Elkin. 2006. “Decision Curve Analysis: A Novel Method for Evaluating Prediction Models.” Medical Decision Making 26 (6): 565–74. https://doi.org/10.1177/0272989X06295361.