Production & Governance: CDS Design, Calibration Drift & Bayesian Audit Trails

Trauma Registry Analytics — Lecture 5 of 5

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

Deployment is where good models go to die. Governance is the discipline that keeps them alive — or retires them before they cause harm.

What You’ll Learn Today

Post 10 Clinical Decision Support

CDS as a system, not a model
Start with the decision
Thresholds encode values
Auditability by design

Post 12 Calibration Under Drift

AUC stable, calibration failing
Two failure modes
SPC-based monitoring
Triggering recalibration

Post 13 Audit-Ready Bayesian Workflows

Prior justification is governance
Posterior provenance
Sensitivity as obligation
Bayesian models in production

Part 1

Clinical Decision Support That Doesn’t Collapse

A system, not a model

Why Most CDS Fails

The model was good. The system failed.

Common failure patterns:

Wrong decision: model predicts 30-day mortality, clinician needs “should I intubate now?”
Wrong time: score delivered at admission, action needed prehospital
Wrong threshold: 50% risk cutoff chosen arbitrarily, not from clinical cost asymmetry
No uncertainty shown: model gives 0.72, clinician treats it as 72% certainty
No feedback loop: model deployed, never retrained, never evaluated post-deployment
No governance: nobody owns it when it drifts

The model is the least important component of a functioning CDS system.

The DoDTR context:

A mortality risk score that fires on admission is useful for:

Triage prioritization under mass casualty
Resource allocation planning
Communication to receiving facility

It is not useful for:

Real-time surgical decision-making
Anesthesia dosing
Prognosis communication to family

Mismatching model output to decision context is the most common CDS failure.

Thresholds Encode Values — Own Them

n <- 1000; prev <- 0.08
df_thresh <- tibble(
  truth = rbinom(n, 1, prev),
  score = plogis(-2.8 + rnorm(n, 0, 1.2) + 2.2*truth)
)

thresholds <- seq(0.03, 0.5, by=0.01)
thresh_df  <- tibble(t=thresholds) |>
  mutate(
    pred = lapply(t, function(th) as.integer(df_thresh$score > th)),
    sens = sapply(pred, function(p) sum(p==1 & df_thresh$truth==1) / sum(df_thresh$truth==1)),
    spec = sapply(pred, function(p) sum(p==0 & df_thresh$truth==0) / sum(df_thresh$truth==0)),
    ppv  = sapply(pred, function(p) {
      tp <- sum(p==1 & df_thresh$truth==1)
      fp <- sum(p==1 & df_thresh$truth==0)
      ifelse(tp+fp>0, tp/(tp+fp), NA)
    })
  )

thresh_df |>
  pivot_longer(c(sens, spec, ppv)) |>
  mutate(name=recode(name, sens="Sensitivity (miss cost)",
                     spec="Specificity (false alert cost)",
                     ppv="PPV (alert precision)")) |>
  ggplot(aes(t, value, color=name)) +
  geom_line(linewidth=1.1) +
  geom_vline(xintercept=0.10, linetype=2, color="#f59e0b") +
  scale_color_manual(values=c("#e63946","#0891b2","#f59e0b")) +
  scale_x_continuous(labels=scales::percent_format()) +
  scale_y_continuous(labels=scales::percent_format()) +
  annotate("text", x=0.12, y=0.95, label="Clinical\nthreshold", color="#f59e0b", size=3) +
  labs(title="Threshold choice: moving right improves PPV but misses more true deaths",
       x="Decision threshold", y=NULL, color=NULL) +
  theme_di()

Choosing a threshold is an ethical decision about the relative cost of missing a death vs. generating a false alert. This must be made by clinicians and commanders — not set at 0.50 by default.

CDS Must Be Auditable by Design

# Every CDS prediction must log:
cds_log <- function(patient_id, score, threshold, decision, model_version, data_hash) {
  tibble(
    timestamp      = Sys.time(),
    patient_id     = patient_id,
    risk_score     = round(score, 4),
    threshold_used = threshold,
    alert_fired    = score > threshold,
    decision_made  = decision,       # actual clinician action
    model_version  = model_version,  # e.g., "TRISS-DoDTR-v2.3"
    input_hash     = data_hash       # hash of input features
  )
}

Why logging matters:

Enables post-deployment calibration monitoring
Creates the audit trail required for clinical governance
Allows comparison of alert→action→outcome chains
Identifies threshold drift (alert rate changes without model change)

An unlogged CDS system cannot be governed — and should not be trusted.

Part 2

Calibration Under Drift

How models become confident and wrong

AUC Stable, Calibration Failing

# Simulate: model trained on pre-2022 data, deployed through 2025
# ISS distribution shifts upward; care quality improves
n_months <- 36; change_pt <- 18

monthly <- tibble(month=1:n_months) |> mutate(
  iss_mean = 25 + 0.2*month,
  care_eff = -0.01*month,
  # True mortality improving despite rising ISS
  true_mort = plogis(-3 + 0.08*iss_mean + care_eff),
  # Model was trained at month 1 conditions — doesn't know about improvement
  model_pred = plogis(-3 + 0.08*iss_mean),  # no care_eff term
  calibration_error = model_pred - true_mort
)

p1 <- ggplot(monthly, aes(month, true_mort)) +
  geom_line(aes(color="True mortality"), linewidth=1.1) +
  geom_line(aes(y=model_pred, color="Model prediction"), linewidth=1.1, linetype=2) +
  scale_color_manual(values=c("#0891b2","#e63946")) +
  labs(title="Model overpredicts mortality as care improves", y="Mortality rate", color=NULL) +
  theme_di()

p2 <- ggplot(monthly, aes(month, calibration_error)) +
  geom_col(fill="#e63946", alpha=0.7) +
  geom_hline(yintercept=0, color="#94a3b8") +
  labs(title="Calibration error grows monotonically", y="Predicted − True", x=NULL) +
  theme_di()

cowplot::plot_grid(p1, p2, ncol=2)

The danger: AUROC may remain high (the model still ranks patients correctly by relative risk) while calibration degrades (the absolute risk estimates become systematically wrong). A model predicting 25% mortality when true risk is 12% will trigger too many aggressive interventions.

SPC-Based Calibration Monitoring

# Monthly O/E ratio monitoring with control limits
set.seed(77)
n_mo <- 30
oe_ratios <- c(
  rnorm(18, 1.0, 0.12),   # in control
  rnorm(12, 1.28, 0.10)   # drift begins at month 19
)

center <- mean(oe_ratios[1:18])
sd_ic  <- sd(oe_ratios[1:18])
ucl    <- center + 3*sd_ic
lcl    <- max(0, center - 3*sd_ic)

tibble(month=1:n_mo, oe=oe_ratios) |>
  mutate(flag = oe > ucl | oe < lcl) |>
  ggplot(aes(month, oe)) +
  geom_hline(yintercept=c(lcl, center, ucl),
             color=c("#253554","#94a3b8","#253554"), linewidth=c(0.8,1,0.8),
             linetype=c(2,1,2)) +
  geom_line(color="#0891b2", linewidth=0.8) +
  geom_point(aes(color=flag), size=3) +
  annotate("text", x=28, y=ucl+0.02, label="UCL", color="#64748b", size=3) +
  annotate("text", x=28, y=center+0.02, label="Center", color="#94a3b8", size=3) +
  scale_color_manual(values=c("#0891b2","#e63946")) +
  geom_vline(xintercept=18.5, linetype=3, color="#f59e0b") +
  annotate("text", x=19.5, y=1.45, label="Drift\nbegins", color="#f59e0b", size=3) +
  labs(title="O/E ratio SPC chart: control limits from in-control period, signal = 3σ breach",
       x="Month", y="Observed/Expected mortality ratio") +
  theme_di() + theme(legend.position="none")

Observed/Expected (O/E) ratio = 1.0 means perfect calibration. O/E > 1 means model underpredicts; O/E < 1 means overpredicts. SPC control limits trigger review when the ratio drifts beyond 3σ.

Part 3

Audit-Ready Bayesian Workflows

Transparency as governance

The Three Audit-Ready Requirements

1. Prior Justification

Every prior must be:

Documented before data is seen
Justified from literature or domain knowledge
Subjected to sensitivity analysis
Version-controlled

“We set Beta(3,17) based on DoDTR historical mortality of ~15% in this theater and mechanism stratum.”

2. Posterior Provenance

Every posterior must record:

Input data version (hash)
Prior specification (version)
MCMC parameters (seed, chains, iterations)
Convergence diagnostics (R-hat, ESS)
Date and environment

Reproducible from a single config file.

3. Sensitivity Analysis

Report at minimum:

Weakly informative prior
Prior from literature
Conservative (skeptical) prior

“Conclusions are consistent across all three prior specifications.”

A Minimal Audit-Ready Bayesian Pipeline

# ── BAYESIAN ANALYSIS AUDIT RECORD ──────────────────────────────────
audit <- list(
  analyst       = "J. Stallings",
  date          = "2026-08-15",
  data_file     = "DoDTR_Q2_2026.csv",
  data_hash     = digest::digest("DoDTR_Q2_2026.csv", algo="sha256", file=TRUE),
  model         = "Beta-Binomial: penetrating extremity mortality",
  prior         = "Beta(3, 17)  # ~15% historical DoDTR rate, low certainty",
  prior_source  = "Eastridge et al. 2012; Theater Medical Data Store 2019-2023",
  likelihood    = "Binomial(n=84, k=6)",
  posterior     = "Beta(9, 95)  # 6 deaths, 78 survivors",
  seed          = 20260815L,
  r_hat_max     = 1.001,
  ess_bulk_min  = 4200,
  sensitivity   = list(
    flat_prior    = "Beta(1,1)  → posterior mean 0.095 [0.044, 0.168]",
    literature    = "Beta(3,17) → posterior mean 0.087 [0.042, 0.146]",
    conservative  = "Beta(8,42) → posterior mean 0.082 [0.048, 0.125]"
  ),
  conclusion    = "Robust across all three prior specifications"
)
jsonlite::write_json(audit, "audit_record_20260815.json", pretty=TRUE)

This JSON file is the audit trail — committed to version control alongside the analysis script and output.

Governance: Bayesian Models in Production

The governance questions that must be answered before deployment:

Who owns this model?
Who approves changes to the threshold?
What triggers a recalibration review?
What criteria trigger model retirement?
Who receives the monthly O/E report?

If any of these are unanswered, the model is not ready to deploy — regardless of its AUC.

Trauma Registry Analytics — Series Complete

Lecture 1: Registry Foundations

Clinical models fail by design, not by math
Audit triangle: claim → data → code
Unit of analysis and time variable discipline

Lecture 2: Modeling Philosophy

Bayesian: priors are transparency
p-values: answer a question nobody asked
Hierarchical: ICC > 0.05 means flat models are wrong

Lecture 3: Missing Data

Missingness pattern before imputation method
Flat imputation in hierarchical data is wrong
MNAR sensitivity: tipping point is the result

Lecture 4: Prediction & Rare Outcomes

Prediction ≠ causation — different estimands
Accuracy at 4% prevalence is meaningless
Decision curve analysis matches clinical reality
SMOTE: check calibration first, avoid if possible

Lecture 5: Production & Governance

CDS is a system — the model is the least important part
Threshold choice is an ethical decision
SPC-based O/E monitoring for calibration drift
Audit-ready Bayesian: prior provenance + sensitivity

The series thesis: Defensible registry analytics requires discipline at every stage — design, modeling, evaluation, deployment, and governance. Statistical sophistication without operational discipline fails.