Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)

Trauma Registry and Other Topics

Why rare clinical outcomes destabilize prediction models, how to evaluate them honestly, and how to build defensible rare-event workflows in R.

Published

June 1, 2025

Modified

June 9, 2026

Executive Summary

Many of the outcomes we care about most in clinical medicine are rare:

Hemorrhagic shock
Unexpected ICU transfer
Cardiac arrest
Mortality

Ironically, these are the exact settings where standard modeling workflows often perform worst, especially when evaluation centers on accuracy instead of calibration, decision utility, and small-event behavior (King and Zeng 2001; Steyerberg 2019; Harrell 2015).

This post explains why rare events destabilize clinical models, why popular fixes like SMOTE are often misapplied, and how to build defensible, clinically honest rare-event models in R.

This is not about “fixing class imbalance.” It is about respecting probability under extreme asymmetry.

Why Rare Events Break Otherwise “Good” Models

Accuracy Optimizes the Wrong World

In a dataset with a 1% event rate, a model that predicts “no” for everyone achieves:

99% accuracy
Zero clinical utility

Yet many training pipelines implicitly reward this behavior.

mean(outcome == 0)

Rare events shift the optimization landscape:

Likelihood is dominated by non-events
Gradients ignore the cases that matter
Default thresholds become meaningless

Failure Mode #1: AUROC Looks Fine While Risk Is Wrong

Discrimination ≠ Risk Estimation

AUROC answers a narrow question:

Can the model rank cases above controls?

It does not answer:

Are probabilities correct?
Are decisions actionable?
Are false negatives catastrophic?

library(yardstick)

roc_auc(preds, truth, .pred_1)
pr_auc(preds, truth, .pred_1)

In rare-event settings, precision–recall curves often better reflect clinical relevance because the positive class is scarce and false reassurance can dominate seemingly strong discrimination summaries (King and Zeng 2001; Harrell 2015).

Does Rare Event Modeling Use SMOTE?

Sometimes — and Often Incorrectly

SMOTE and related oversampling methods attempt to rebalance the data by synthetically generating cases.

This can help when:

The goal is ranking or screening
The feature space is well-behaved
Calibration is not the primary output

But in clinical prediction, SMOTE often:

Creates implausible “patients”
Distorts absolute risk
Breaks interpretability and auditability

Oversampling fixes class counts. It does not fix probability, and recalibration is still required if absolute risk is the target (Steyerberg 2019; Harrell 2015).

Preferred Fixes (Before You Ever Reach for SMOTE)

Use Cost-Sensitive Learning

Let the model know that errors are asymmetric (Elkan 2001).

train <- train |>
  dplyr::mutate(
    w = if_else(outcome == 1, 10, 1)
  )

glm(
  outcome ~ .,
  data = train,
  family = binomial(),
  weights = w
)

This preserves the data while shifting the objective.

Penalization and Bias Reduction

Rare events amplify separation and coefficient blow-up.

Penalization stabilizes inference.

library(glmnet)

x <- model.matrix(outcome ~ . - 1, data = train)
y <- train$outcome

cv.glmnet(x, y, family = "binomial", alpha = 1)

The goal is not sparsity. It is controlled optimism.

Bayesian Logistic Models (Often the Cleanest Fix)

Informative priors naturally regularize rare-event models.

library(brms)

fit <- brm(
  outcome ~ predictors,
  data = train,
  family = bernoulli(),
  prior = prior(normal(0, 1), class = "b")
)

Bayesian models:

Reduce overconfidence
Quantify uncertainty honestly
Fail more gracefully under scarcity

These advantages are strongest when priors are chosen deliberately and performance is still checked on clinically meaningful scales (Gelman et al. 2013, 2020).

If You Use SMOTE, Do It Safely

If you choose SMOTE:

Apply it inside resampling
Never oversample the test set
Always recalibrate afterward

# Pseudocode — SMOTE must occur inside CV folds

Treat SMOTE as a data augmentation tactic, not a modeling philosophy.

Evaluation That Matches Clinical Reality

Rare-event models should be evaluated on:

Precision at clinically relevant thresholds
Expected utility
Calibration error

brier <- mean((truth - .pred_1)^2)

If your model cannot produce reliable probabilities, it cannot support decisions.

Where This Shows Up in AI/ML

SMOTE and other synthetic oversampling methods are widely used in DoDTR-based rare event modeling, but they create synthetic training examples that do not correspond to real patients — inflating apparent minority class performance during cross-validation while producing poorly calibrated probabilities in deployment. For a MAVEN massive transfusion alert where the stated probability drives a threshold decision about activating a blood product protocol, miscalibration at the positive class is not a minor technical flaw — it is a patient safety issue. Calibration-aware rare event modeling, using isotonic regression or Platt scaling applied post-hoc to the raw model scores, is required before any trauma AI tool’s probability outputs can be trusted for clinical decision support. Oversampling improves discrimination; it does not fix calibration, and in trauma AI those are different things.

Closing Thoughts

Rare event modeling is not about tricking algorithms into seeing positives.

It is about:

Acknowledging asymmetry
Respecting uncertainty
Designing for consequences

The rarest outcomes are often the most important. They deserve models built with restraint.

📚 Go Deeper: Rare Events Toolkit

This post is part of the Rare Events Toolkit — a companion reference with PR-AUC templates, weighted loss examples, and Bayesian prior sensitivity checks for rare clinical outcomes.

→ Open the Rare Events Toolkit

Series Callout

Note

This post is part of a broader Trauma Registry and Other Topics Series:

Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
Missing Data Is the Real Model: Practical Strategies in R
From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
Why Statistical Significance Is a Terrible Stopping Rule
Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality

Series: Trauma Registry & Outcomes

← Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny | Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R) →

References

Elkan, Charles. 2001. “The Foundations of Cost-Sensitive Learning.” Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence 2: 973–78.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Chapman; Hall/CRC.

Gelman, Andrew, Aki Vehtari, Daniel Simpson, et al. 2020. Bayesian Workflow. https://arxiv.org/abs/2011.01808.

Harrell, Jr., Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.

King, Gary, and Langche Zeng. 2001. “Logistic Regression in Rare Events Data.” Political Analysis 9 (2): 137–63. https://doi.org/10.1093/oxfordjournals.pan.a004868.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.