How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
Executive Summary
Rare outcomes are where models are most seductive — and most dangerous.
In healthcare, trauma, and safety-critical systems, outcomes like: - death, - hemorrhage, - catastrophic deterioration, - treatment failure,
are often rare by design.
Standard evaluation metrics were not built for this setting.
This post explains: - why commonly reported metrics fail, - how evaluation choices encode ethical tradeoffs, - and how to evaluate rare-event models honestly using R.
Rare Outcomes Change Everything
When outcomes are rare: - accuracy becomes meaningless, - AUROC becomes misleading, - averages hide catastrophic failure, - and false negatives dominate harm.
A model that is “95% accurate” can still miss most patients who matter.
Accuracy Is the Most Misleading Metric You Can Report
Consider a condition with 2% prevalence.
A model that predicts “no event” for everyone is: - 98% accurate, - completely useless, - and ethically indefensible.
Accuracy rewards models for ignoring rare events.
Why AUROC Fails You When Events Are Rare
AUROC answers this question:
“Can the model rank positives higher than negatives?”
It does not answer: - how many events you’ll actually catch, - how many false alarms you’ll create, - whether thresholds are usable.
In rare outcomes, AUROC can look excellent while recall is terrible. That is one reason precision–recall summaries are often more informative when prevalence is low (Davis and Goadrich 2006; King and Zeng 2001).
library(yardstick)
roc_auc(preds, truth, .pred)This number alone is not actionable.
Precision–Recall Curves Are Necessary (but Not Sufficient)
Precision–recall (PR) curves focus on:
- positive cases,
- tradeoffs clinicians actually face.
pr_auc(preds, truth, .pred)PR curves improve visibility, but they still:
- average over thresholds,
- hide operational constraints,
- obscure asymmetric harm.
They’re better — not enough.
Thresholds Are Ethical Decisions, Not Technical Ones
In rare outcomes:
- false negatives often cost lives,
- false positives cost time, resources, trust.
Choosing a threshold answers:
“Who are we willing to miss?”
That is not a statistical question. It’s an ethical one.
Evaluate at Decision-Relevant Thresholds
Stop reporting only global metrics.
Evaluate:
- sensitivity at fixed false-positive rates,
- PPV at clinically feasible alert volumes,
- performance under resource constraints.
library(dplyr)
preds |>
mutate(alert = .pred > 0.15) |>
summarise(
sensitivity = mean(alert & truth),
ppv = mean(truth[alert == TRUE])
)If a threshold can’t be used, it doesn’t matter how good the curve looks.
Calibration Matters More Than Discrimination
In rare outcomes, miscalibration kills trust.
If a model says:
- “20% risk” and the true risk is 2%,
- clinicians stop listening.
library(ggplot2)
ggplot(preds, aes(x = .pred, y = truth)) +
geom_smooth(method = "loess")A well-calibrated modest model beats a poorly calibrated “strong” one. In clinical prediction, calibration is often the deciding feature for whether a model can actually support action (Van Calster et al. 2016; Steyerberg 2019).
Decision Curve Analysis Makes Tradeoffs Explicit
Decision curve analysis asks:
“Is using this model better than doing nothing — or everything?”
library(rmda)
decision_curve(
truth ~ .pred,
data = preds,
thresholds = seq(0.01, 0.5, by = 0.01)
)This reframes evaluation around net benefit, not elegance (Vickers and Elkin 2006).
Class Imbalance Is Not a Modeling Problem — It’s a Reality Problem
Oversampling, weighting, and SMOTE (Chawla et al. 2002):
- change training behavior,
- do not change the real-world base rate.
Evaluation must always return to:
- true prevalence,
- real alert volumes,
- actual workflow capacity.
Never evaluate on artificial prevalence without stating it clearly.
Hierarchy and Rare Events Interact Badly
In healthcare:
- rare events cluster,
- sites vary,
- volumes differ.
Flat models often:
- exaggerate site differences,
- produce unstable estimates,
- mislead benchmarking.
Hierarchical models stabilize rare-event inference.
library(brms)
fit <- brm(
outcome ~ predictor + (1 | site),
data = data,
family = bernoulli()
)Sensitivity Analysis Is Not Optional
For rare outcomes, ask:
- What if prevalence shifts?
- What if documentation changes?
- What if thresholds move slightly?
If conclusions flip easily, that’s critical information — not failure.
Reporting Rare-Outcome Performance Honestly
An honest report includes:
- prevalence,
- alert volume at chosen thresholds,
- false negatives per 1000 patients,
- calibration plots,
- decision curves,
- explicit tradeoffs.
Anything less is marketing.
Trauma mortality, massive transfusion requirement, and surgical complication are all rare outcomes — typically 5–15% prevalence in DoDTR data — and standard accuracy metrics catastrophically misrepresent model performance in these settings. A model that predicts “no massive transfusion” for every patient achieves 88% accuracy on registry data while being useless for the clinical decision MAVEN was built to support; precision-recall curves and F-measure at operationally relevant thresholds are the appropriate evaluation framework. The choice of threshold is itself a clinical decision: in a forward surgical setting with limited blood product availability, the cost of a missed MTP activation differs sharply from the cost of an unnecessary activation, and that asymmetry must be encoded in the evaluation metric before any model comparison is meaningful. Evaluating trauma AI on accuracy is not just a methodological error — it is a failure to take the clinical context seriously.
Closing: Evaluation Is a Moral Act
When outcomes are rare:
- metrics are not neutral,
- thresholds encode values,
- and errors have unequal cost.
Model evaluation is not a formality. It is where responsibility lives.
If your evaluation makes rare outcomes look easy, you’ve probably evaluated the wrong thing.
This post is part of the Rare Events Toolkit — a companion reference with PR-AUC templates, calibration plots by site, and reviewer-safe evaluation language for rare clinical outcomes.
Series Callout
This post is part of a broader Trauma Registry and Other Topics Series:
- Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
- Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
- Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
- Missing Data Is the Real Model: Practical Strategies in R
- From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
- Why Statistical Significance Is a Terrible Stopping Rule
- Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
- Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
- How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
- Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
- Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
- Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
- Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
- Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
- MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality