How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)

Trauma Registry and Other Topics

Rare Events

How to evaluate rare-outcome clinical models honestly using calibration, threshold-aware metrics, and decision-relevant performance summaries.

Published

July 1, 2024

Modified

June 9, 2026

Executive Summary

Rare outcomes are where models are most seductive — and most dangerous.

In healthcare, trauma, and safety-critical systems, outcomes like: - death, - hemorrhage, - catastrophic deterioration, - treatment failure,

are often rare by design.

Standard evaluation metrics were not built for this setting.

This post explains: - why commonly reported metrics fail, - how evaluation choices encode ethical tradeoffs, - and how to evaluate rare-event models honestly using R.

Rare Outcomes Change Everything

When outcomes are rare: - accuracy becomes meaningless, - AUROC becomes misleading, - averages hide catastrophic failure, - and false negatives dominate harm.

A model that is “95% accurate” can still miss most patients who matter.

Accuracy Is the Most Misleading Metric You Can Report

Consider a condition with 2% prevalence.

A model that predicts “no event” for everyone is: - 98% accurate, - completely useless, - and ethically indefensible.

Accuracy rewards models for ignoring rare events.

Why AUROC Fails You When Events Are Rare

AUROC answers this question:

“Can the model rank positives higher than negatives?”

It does not answer: - how many events you’ll actually catch, - how many false alarms you’ll create, - whether thresholds are usable.

In rare outcomes, AUROC can look excellent while recall is terrible. That is one reason precision–recall summaries are often more informative when prevalence is low (Davis and Goadrich 2006; King and Zeng 2001).

library(yardstick)

roc_auc(preds, truth, .pred)

This number alone is not actionable.

Precision–Recall Curves Are Necessary (but Not Sufficient)

Precision–recall (PR) curves focus on:

positive cases,
tradeoffs clinicians actually face.

pr_auc(preds, truth, .pred)

PR curves improve visibility, but they still:

average over thresholds,
hide operational constraints,
obscure asymmetric harm.

They’re better — not enough.

Thresholds Are Ethical Decisions, Not Technical Ones

In rare outcomes:

false negatives often cost lives,
false positives cost time, resources, trust.

Choosing a threshold answers:

“Who are we willing to miss?”

That is not a statistical question. It’s an ethical one.

Evaluate at Decision-Relevant Thresholds

Stop reporting only global metrics.

Evaluate:

sensitivity at fixed false-positive rates,
PPV at clinically feasible alert volumes,
performance under resource constraints.

library(dplyr)

preds |>
  mutate(alert = .pred > 0.15) |>
  summarise(
    sensitivity = mean(alert & truth),
    ppv = mean(truth[alert == TRUE])
  )

If a threshold can’t be used, it doesn’t matter how good the curve looks.

Calibration Matters More Than Discrimination

In rare outcomes, miscalibration kills trust.

If a model says:

“20% risk” and the true risk is 2%,
clinicians stop listening.

library(ggplot2)

ggplot(preds, aes(x = .pred, y = truth)) +
  geom_smooth(method = "loess")

A well-calibrated modest model beats a poorly calibrated “strong” one. In clinical prediction, calibration is often the deciding feature for whether a model can actually support action (Van Calster et al. 2016; Steyerberg 2019).

Decision Curve Analysis Makes Tradeoffs Explicit

Decision curve analysis asks:

“Is using this model better than doing nothing — or everything?”

library(rmda)

decision_curve(
  truth ~ .pred,
  data = preds,
  thresholds = seq(0.01, 0.5, by = 0.01)
)

This reframes evaluation around net benefit, not elegance (Vickers and Elkin 2006).

Class Imbalance Is Not a Modeling Problem — It’s a Reality Problem

Oversampling, weighting, and SMOTE (Chawla et al. 2002):

change training behavior,
do not change the real-world base rate.

Evaluation must always return to:

true prevalence,
real alert volumes,
actual workflow capacity.

Never evaluate on artificial prevalence without stating it clearly.

Hierarchy and Rare Events Interact Badly

In healthcare:

rare events cluster,
sites vary,
volumes differ.

Flat models often:

exaggerate site differences,
produce unstable estimates,
mislead benchmarking.

Hierarchical models stabilize rare-event inference.

library(brms)

fit <- brm(
  outcome ~ predictor + (1 | site),
  data = data,
  family = bernoulli()
)

Sensitivity Analysis Is Not Optional

For rare outcomes, ask:

What if prevalence shifts?
What if documentation changes?
What if thresholds move slightly?

If conclusions flip easily, that’s critical information — not failure.

Reporting Rare-Outcome Performance Honestly

An honest report includes:

prevalence,
alert volume at chosen thresholds,
false negatives per 1000 patients,
calibration plots,
decision curves,
explicit tradeoffs.

Anything less is marketing.

Where This Shows Up in AI/ML

Trauma mortality, massive transfusion requirement, and surgical complication are all rare outcomes — typically 5–15% prevalence in DoDTR data — and standard accuracy metrics catastrophically misrepresent model performance in these settings. A model that predicts “no massive transfusion” for every patient achieves 88% accuracy on registry data while being useless for the clinical decision MAVEN was built to support; precision-recall curves and F-measure at operationally relevant thresholds are the appropriate evaluation framework. The choice of threshold is itself a clinical decision: in a forward surgical setting with limited blood product availability, the cost of a missed MTP activation differs sharply from the cost of an unnecessary activation, and that asymmetry must be encoded in the evaluation metric before any model comparison is meaningful. Evaluating trauma AI on accuracy is not just a methodological error — it is a failure to take the clinical context seriously.

Closing: Evaluation Is a Moral Act

When outcomes are rare:

metrics are not neutral,
thresholds encode values,
and errors have unequal cost.

Model evaluation is not a formality. It is where responsibility lives.

If your evaluation makes rare outcomes look easy, you’ve probably evaluated the wrong thing.

📚 Go Deeper: Rare Events Toolkit

This post is part of the Rare Events Toolkit — a companion reference with PR-AUC templates, calibration plots by site, and reviewer-safe evaluation language for rare clinical outcomes.

→ Open the Rare Events Toolkit

Series Callout

Note

This post is part of a broader Trauma Registry and Other Topics Series:

Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
Missing Data Is the Real Model: Practical Strategies in R
From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
Why Statistical Significance Is a Terrible Stopping Rule
Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality

Series: Trauma Registry & Outcomes

← Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics | Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny →

References

Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57. https://doi.org/10.1613/jair.953.

Davis, Jesse, and Mark Goadrich. 2006. “The Relationship Between Precision-Recall and ROC Curves.” Proceedings of the 23rd International Conference on Machine Learning, 233–40. https://doi.org/10.1145/1143844.1143874.

King, Gary, and Langche Zeng. 2001. “Logistic Regression in Rare Events Data.” Political Analysis 9 (2): 137–63. https://doi.org/10.1093/oxfordjournals.pan.a004868.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.

Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Ben De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74: 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.

Vickers, Andrew J., and Elkin B. Elkin. 2006. “Decision Curve Analysis: A Novel Method for Evaluating Prediction Models.” Medical Decision Making 26 (6): 565–74. https://doi.org/10.1177/0272989X06295361.