Registry Foundations: Why Models Fail & How to Analyze Honestly

Trauma Registry Analytics — Lecture 1 of 5

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

Most clinical models fail not because of bad math — but because of bad assumptions about what the data actually represents.

What You’ll Learn Today

Post 01 Why Clinical Models Fail

  • The myth of good performance
  • 5 failure modes with R examples
  • Dataset shift as the default
  • A different definition of success

Post 02 Audit-Ready Analysis

  • The audit triangle
  • Analysis contracts
  • Provenance and traceability
  • Reproducible folder structure

Post 05 Analyzing Registry Data

  • The unit of analysis problem
  • Time as an abused variable
  • Proxies vs. measures
  • Sensitivity as discipline

Part 1

Why Clinical Models Fail

The five failure modes that happen before the model is even wrong

The Myth of “Good Performance”

# Model A: high AUC, badly calibrated (overconfident at high risk)
# Model B: lower AUC, well calibrated
n <- 1000
set.seed(42)
true_p <- rbeta(n, 1.5, 8)  # true risk: mostly low, some high
y      <- rbinom(n, 1, true_p)

# Model A: discriminates well, but systematically overestimates at high end
pred_a <- pmin(plogis(qlogis(true_p) + rnorm(n, 0.5, 0.8)), 0.99)
# Model B: slightly less discriminating, well calibrated
pred_b <- plogis(qlogis(true_p) + rnorm(n, 0, 1.1))

auc_a <- as.numeric(pROC::auc(pROC::roc(y, pred_a, quiet=TRUE)))
auc_b <- as.numeric(pROC::auc(pROC::roc(y, pred_b, quiet=TRUE)))

bind_rows(
  tibble(model="A (AUC=.84, miscalibrated)", pred=pred_a, y=y),
  tibble(model="B (AUC=.79, calibrated)",    pred=pred_b, y=y)
) |>
  mutate(decile=ntile(pred, 10)) |>
  group_by(model, decile) |>
  summarise(pred_mean=mean(pred), obs=mean(y), .groups="drop") |>
  ggplot(aes(pred_mean, obs, color=model)) +
  geom_abline(linetype=2, color="#64748b") +
  geom_line(linewidth=1) + geom_point(size=3) +
  scale_color_manual(values=c("#e63946","#0891b2")) +
  labs(title="Model A has higher AUC but is miscalibrated — Model B is safer for decisions",
       x="Mean predicted risk", y="Observed event rate", color=NULL) +
  theme_di()

AUC tells you about ranking. Calibration tells you whether the numbers mean anything. A model that says 40% risk when the true risk is 15% will cause harm — regardless of its AUC.

The Five Failure Modes

FM1: Dataset shift Training population ≠ deployment population. ISS distributions, mechanism mix, and care standards change across time and theater.

FM2: Missing data treated as noise Missingness in registries is structured — sicker patients have more missing labs. Ignoring structure = biased estimates.

FM3: Wrong metric optimized Accuracy on a 5% mortality outcome. AUC instead of calibration. The metric shapes what the model learns to do.

FM4: Hierarchy ignored Patients nest within facilities, facilities within theaters, theaters across conflicts. Flat models produce anti-conservative inference.

FM5: Prediction treated as the final product A risk score delivered without decision context, threshold justification, or operator training is not clinical decision support. It is a number.

All five are design failures — not modeling failures. The model can be mathematically correct while all five are present.

Dataset Shift: The Default Condition

# Train on 2018-2021 DoDTR analog; test on 2022-2024 (protocol change + population shift)
n_train <- 600; n_test <- 300

df_train <- tibble(
  iss   = rnorm(n_train, 26, 11),
  era   = "Train (2018–21)",
  died  = rbinom(n_train, 1, plogis(-3.5 + 0.08*iss))
)
df_test <- tibble(
  iss   = rnorm(n_test, 31, 13),   # higher severity in later era
  era   = "Test (2022–24)",
  died  = rbinom(n_test, 1, plogis(-4.0 + 0.08*iss))  # improved care
)

bind_rows(df_train, df_test) |>
  ggplot(aes(iss, fill=era, color=era)) +
  geom_density(alpha=0.4, linewidth=0.8) +
  scale_fill_manual(values=c("#2563eb","#e63946")) +
  scale_color_manual(values=c("#2563eb","#e63946")) +
  labs(title="ISS distribution shift between training and test eras — model trained on blue, deployed to red",
       x="ISS", y="Density", fill=NULL, color=NULL) +
  theme_di()

A model trained on the blue distribution will be systematically miscalibrated on the red population — not because the model is wrong, but because the world changed.

Part 2

Audit-Ready Analysis

The audit triangle: claim, data, code

What “Audit-Ready” Actually Means

An audit-ready analysis can answer three questions from any reviewer:

The Audit Triangle

  1. Claim — What exactly does this analysis assert?
  2. Data — Precisely which data version produced this result?
  3. Code — What sequence of operations produced the claim from that data?

All three must be traceable, reproducible, and stable over time.

What breaks auditability:

  • Results in Word docs separated from the code that generated them
  • Dataset version unrecorded (or “the latest pull”)
  • Analysis parameters hard-coded without documentation
  • Figures regenerated manually after model changes
  • Sensitivity analyses run ad hoc, not pre-specified

Military context: DoDTR analyses that inform clinical practice guidelines, resource allocation, or J9 briefs must be audit-ready. If a commander asks “how was this number calculated?”, the answer must be traceable — not reconstructed from memory.

The Analysis Contract

Before writing a single line of code:

# ── ANALYSIS CONTRACT ────────────────────────────────────────────────
# Title:    DoDTR Tourniquet Timing and Limb Salvage — Q3 2026 Report
# Author:   J. Stallings
# Date:     2026-08-01
# Data:     DoDTR_extract_20260730.csv  (SHA-256: a3f7c...)
# Question: Does prehospital tourniquet application within 30 min of
#           injury reduce limb amputation rate vs. > 30 min?
# Estimand: ATT among penetrating extremity injuries, ISS 10–40
# Primary outcome: Limb amputation within 30 days
# Primary analysis: Logistic regression with IPTW, stabilized weights
# Secondary:  Sensitivity to weight trimming (95th vs. 99th percentile)
# Exclusions: ISS > 50, pediatric (age < 18), incomplete limb data
# ─────────────────────────────────────────────────────────────────────

This contract is committed to version control before analysis begins. Any deviation requires an amendment with rationale.

Provenance: Data Version Must Be Explicit

# At the top of every analysis script:
# In a real workflow, hash the actual data file:
# data_hash <- digest::digest("DoDTR_extract_20260730.csv", algo="sha256", file=TRUE)

data_file  <- "DoDTR_extract_20260730.csv"
data_hash  <- "a3f7c1d9e8b245f0c6d3a1e7b924f501e3d8c2..."  # example SHA-256
session_dt <- format(Sys.time(), "%Y-%m-%d %H:%M")

cat("Analysis provenance:\n")
Analysis provenance:
cat("  File:    ", data_file,  "\n")
  File:     DoDTR_extract_20260730.csv 
cat("  SHA-256: ", data_hash,  "\n")
  SHA-256:  a3f7c1d9e8b245f0c6d3a1e7b924f501e3d8c2... 
cat("  Run at:  ", session_dt, "\n")
  Run at:   2026-06-18 06:30 
cat("  R:       ", R.version$version.string, "\n")
  R:        R version 4.6.0 (2026-04-24) 

Why this matters: If a dataset is corrected, re-extracted, or filtered differently after the fact, the hash changes — making the discrepancy immediately visible. “Which version?” becomes answerable in seconds, not hours of email archaeology.

Part 3

Analyzing Messy Trauma Data

The principles before the models

The Unit of Analysis Problem

Trauma registry data can be analyzed at multiple levels. Conflating them produces wrong answers:

Unit Question it answers
Patient What predicts individual mortality?
Encounter What characterizes each care episode?
Facility How does care quality vary by site?
Theater How does population injury pattern shift?
Time period How do outcomes change as protocols evolve?

The error: Analyzing at the wrong level and interpreting as if at another.

Example: A patient transferred Role 2 → Role 3 → Role 4 appears in the registry three times if encounter is the unit — but once if patient is the unit. A mortality model trained on encounters will count each transfer patient multiple times, inflating precision and deflating mortality rate estimates.

Time Is the Most Abused Variable

# Simulate: outcome rates changing over time (improving care + worsening case mix)
months <- seq(as.Date("2019-01-01"), as.Date("2024-12-01"), by="month")
n_m <- length(months)

df_time <- tibble(
  month      = months,
  iss_mean   = 24 + 0.12 * seq_len(n_m) + rnorm(n_m, 0, 1.5),   # worsening case mix
  care_effect = -0.008 * seq_len(n_m),                              # improving protocols
  mortality   = plogis(-2.5 + 0.04*iss_mean + care_effect + rnorm(n_m, 0, 0.15))
)

df_time |>
  pivot_longer(c(iss_mean, mortality)) |>
  mutate(name=recode(name, iss_mean="Mean ISS (case mix)",
                     mortality="Observed mortality rate")) |>
  ggplot(aes(month, value, color=name)) +
  geom_line(linewidth=1.1) +
  geom_smooth(method="loess", se=FALSE, linewidth=0.7, linetype=3) +
  facet_wrap(~name, scales="free_y") +
  scale_color_manual(values=c("#e63946","#0891b2")) +
  labs(title="Simultaneous case-mix worsening and care improvement — naive trend is misleading",
       x=NULL, y=NULL) +
  theme_di() + theme(legend.position="none")

A rising mortality trend may mean care is getting worse — or that cases are getting harder. Disentangling requires risk-adjusted analysis over time, not raw trend lines.

Lecture 1 — Key Takeaways

Why Models Fail

  • AUC without calibration is incomplete and dangerous
  • Dataset shift is the default — validate forward in time
  • Hierarchy, missingness, and metric choice are design problems
  • A model deployed without decision context is not CDS

Audit-Ready Analysis

  • Claim → Data → Code: all three must be traceable
  • Write the analysis contract before the first line of code
  • Hash the data file; record the session
  • Pre-specify sensitivity analyses

Analyzing Registry Data

  • Define the unit of analysis before defining the model
  • Time is a variable, a confounder, and an effect modifier simultaneously
  • Severity is not static — adjust for case mix before comparing trends
  • Proxies are not measures: document what’s actually in each field

The meta-lesson: The discipline required to produce defensible registry analysis is front-loaded. It lives in the design document, the analysis contract, and the explicit assumptions — not in the statistical method.

Coming Up: Lecture 2

Modeling Philosophy: Bayesian Thinking, Beyond p-Values & Hierarchical Models

Posts 03, 06 & 07:

  • Bayesian models for clinicians — priors, posteriors, credible intervals, decision-focused outputs
  • Beyond statistical significance — what p < 0.05 cannot answer, better questions and metrics
  • Hierarchical models — partial pooling, why flat models fail, the minimal multilevel model