---
title: "Trauma Registry Analytics — Master Speaker Notes"
subtitle: "Instructor Teaching Guide · 5-Lecture Series"
author: "Jonathan D. Stallings, PhD, MS"
date: "Summer 2026"
format:
  html:
    toc: true
    toc-depth: 3
    toc-title: "Lecture Navigator"
    number-sections: true
    theme: cosmo
---

> **How to use this guide.** Instructor-facing notes for the 5-lecture Trauma Registry Analytics series. Designed for analysts, fellows, and informatics staff who work with DoDTR or similar registry data. Assumes familiarity with basic regression and hypothesis testing (Applied Statistics series prerequisite).

---

# Lecture 1 — Registry Foundations: Why Models Fail & How to Analyze Honestly

**Posts covered:** 01 (Why clinical models fail), 02 (Audit-ready analysis), 05 (Analyzing registry data)

## Teaching strategy

Start with a provocation: "Tell me about a clinical AI or registry-based tool you've used. Now tell me how it was validated." The silence — or vague answers — sets the stage. Most deployed clinical tools have incomplete validation documentation. This lecture gives the framework for asking the right questions.

The failure mode taxonomy (FM1–FM5) is the spine of this lecture. Return to it repeatedly. When you show an analysis failure, ask: which failure mode does this reflect? Make the audience categorize.

## Key talking points

**Slide: The Myth of Good Performance (AUC vs. Calibration)**
This is arguably the most important chart in the series. Model A discriminates better but is systematically overconfident at high risk — it says 40% when the true risk is 20%. Model B discriminates slightly worse but is well calibrated. In trauma triage, where high-risk patients trigger resource mobilization, Model A's miscalibration causes over-resourcing for low-risk patients and under-treatment for moderate-risk patients despite appearing superior.

**Slide: Dataset Shift as the Default**
The ISS distribution shift simulation lands well when you frame it as: "Your model was trained on 2018–2021 DoDTR data. It's now 2026. What do you expect to have changed?" Trauma mechanism mix, care protocols, echelon structure, population demographics. The model didn't change — the world did.

**Slide: The Audit Triangle**
The claim-data-code triangle is a mental model the audience should internalize. Every time they present an analysis, they should be able to answer: "What exactly am I claiming? Precisely which data version produced this? What code sequence generated this result?" If any leg of the triangle is weak, the analysis is not defensible.

**Slide: Time as the Most Abused Variable**
The dual-panel figure (rising ISS + falling mortality) is a teaching moment: these two trends, if shown alone, would support opposite conclusions. Together, they reveal case-mix confounding. Never show a mortality trend without a case-mix trend in the same period.

## Timing
- Model failure modes (1–5): 20 min
- Audit-ready analysis: 20 min
- Analyzing registry data: 15 min

## Discussion prompt
"You receive a request to show whether trauma mortality rates have improved over the past 5 years at your facility. Before running any analysis, what are the first three questions you need to answer?"

---

# Lecture 2 — Modeling Philosophy: Bayesian Thinking, Beyond p-Values & Hierarchical Models

**Posts covered:** 03 (Bayesian models), 06 (Statistical significance), 07 (Hierarchical models)

## Teaching strategy

This lecture challenges the default statistical toolkit for registry analysis. The default is: run a logistic regression, check p-values, report significance. The lecture argues: (1) Bayesian models give you more honest uncertainty quantification; (2) p-values answer the wrong question for clinical decision-making; (3) ignoring hierarchy in registry data produces anti-conservative inference.

The lme4 random intercept model at the end is the concrete deliverable. Even if the audience doesn't leave able to run it, they should be able to recognize when it's needed and ask for it.

## Key talking points

**Slide: The Bayesian Update — Beta-Binomial**
Walk through the Beta-Binomial update manually with numbers from the slide: "We started with a prior that says ~15% mortality (Beta(3,17)). We observed 14 deaths in 60 patients. The posterior updates to Beta(17,63), which has a mean of about 21%. The full posterior distribution tells us not just the best estimate but the full range of plausible values, weighted by likelihood."

**Slide: p-Value Sample Size Trap**
The simulation showing a true effect d=0.3 becoming "significant" only at large n is the clearest demonstration that significance is a function of sample size, not of effect magnitude. With the entire DoDTR (n>50,000), virtually any effect will be significant. That doesn't make it meaningful.

**Slide: ICC and Facility Mortality Rates**
Show the facility bar chart before discussing what ICC means. Ask the audience: "Does the variation in mortality rates across facilities reflect real differences in care quality — or chance?" The ICC answers this question. If ICC = 0.10, 10% of the total variance in outcomes is attributable to facility membership. That's non-trivial and should not be ignored.

**Slide: Partial Pooling — The Shrinkage Diagram**
The key insight: facilities with small n get their estimate shrunk toward the grand mean. This is appropriate: a facility with 8 patients and 2 deaths has a raw mortality rate of 25%, but we have very little confidence in that estimate. Partial pooling automatically weights by information content.

## Timing
- Bayesian thinking: 20 min
- Beyond p-values: 20 min
- Hierarchical models: 20 min
- lme4 demo: 10 min (can be skipped for non-technical audiences)

## Common questions
- *"Do I need special software for Bayesian analysis?"* Stan/brms for full Bayesian; lme4 for hierarchical models (frequentist, but captures the key structure). In most registry settings, lme4 is the right starting point.
- *"What ICC threshold requires a hierarchical model?"* ICC > 0.05 is the common threshold. But even small ICC can matter when cluster size is large.

---

# Lecture 3 — Missing Data Deep Dive

**Posts covered:** 04 (Missing data strategies), 14 (Missing data in hierarchical models), 15 (MNAR sensitivity analysis)

## Teaching strategy

This is the most technically dense lecture in the series. Calibrate depth to audience. For clinical audiences: focus on mechanisms and the implications of choices. For analyst audiences: go deep on the hierarchical missingness section and the MNAR delta curve.

Open by asking the audience to estimate the missing rate for key variables in their registry data. Most will underestimate. Then show what typical missingness looks like in a real-world trauma registry (30%+ for some lab values, 15-20% for physiologic variables in high-tempo settings). Missingness is not a small problem to clean up — it's a structural feature to characterize.

## Key talking points

**Slide: Missingness by ISS Group**
The horizontal bar chart makes the pattern immediate: missingness is higher in more severely injured patients. Ask the audience: "If we do complete-case analysis (exclude anyone with missing data), who are we most likely to exclude?" The sickest patients. The model will be trained on a healthier-than-average subset and will systematically underestimate risk for severe patients.

**Slide: Pattern Heatmap**
The heatmap shows which variables are missing together. Clustered missingness (variables that are always missing together) suggests a common mechanism — often a care echelon or documentation system that doesn't capture certain variable types. This is important for the imputation model: you can use the missingness pattern itself as a predictor.

**Slide: Mean Imputation's Tail Destruction**
The density plots are the killer visualization. Mean imputation creates an artificial spike, destroys the right tail, and produces a mean-imputed distribution that is incompatible with the true distribution of the outcome model. This drives home why MI is strictly preferable.

**Slide: MNAR Delta Curve**
The delta curve should be used as a *communication* tool, not just an analysis tool. Walk the audience through: "If every missing lab value is 0.2 units higher than it would appear (delta = 0.2), the OR changes from 1.4 to 1.6. The conclusion remains qualitatively the same. At delta = 1.0, the OR exceeds the clinical significance threshold. Is delta = 1.0 plausible in this setting?" This anchors the sensitivity analysis in clinical judgment.

## Timing
- Missingness mechanisms and detection: 20 min
- Imputation methods: 15 min
- Hierarchical missingness: 15 min
- MNAR sensitivity analysis: 10 min

---

# Lecture 4 — Prediction & Rare Outcomes

**Posts covered:** 08 (Evaluating rare outcomes), 09 (Evaluating models for rare outcomes), 11 (Prediction ≠ causation)

## Teaching strategy

Trauma mortality is often a rare outcome in registry data (3–10% depending on the cohort). Rare outcome modeling has specific failure modes that standard model evaluation hides. This lecture addresses the accuracy trap, the PR curve, and decision curve analysis — three tools that give a more complete picture.

The prediction vs. causation distinction closes the lecture because it's the most consequential misconception in clinical AI: people use risk scores as if they implied causation, and deploy interventions on high-risk patients based on risk scores that were never validated as causal mechanisms.

## Key talking points

**Slide: The Accuracy Trap — 4%**
If 4% of patients die, a model that predicts "no death" for everyone has 96% accuracy. Show this with numbers: in a registry of 10,000 patients with 400 deaths, a null model gets 9,600 right. Any model that's evaluated purely on accuracy will be compared favorably to a model that never fires. Sensitivity, specificity, PPV, and NPV are the minimum.

**Slide: PR Curve vs. ROC**
The PR curve focuses on the minority class (events). In rare outcome settings, the ROC can look good even when the model has very poor precision for the rare event. The PR curve shows: of all the patients flagged by my model, what fraction actually had the event? For a clinical alert system, this is what matters — alert fatigue is driven by PPV, not AUC.

**Slide: Decision Curve Analysis**
DCA requires the presenter to frame it as answering: "Should I use this model at this threshold?" At threshold 10%: for every 100 patients flagged (10+ predicted risk), how many deaths does this model prevent relative to treating all or treating none? Show where the model curve lies above both baselines — that's the region where the model adds value.

**Slide: Prediction ≠ Causation — The Confounded U**
The DAG visualization is the critical teaching moment. Walk through: "ISS predicts mortality. But ISS also predicts the level of care received. And the level of care also predicts mortality. If you use ISS to predict mortality without accounting for this structure, you are measuring the total effect — which includes the part mediated through care level." A prediction model that includes care variables may actually mask the treatment effect.

## Timing
- Rare outcomes: accuracy trap, AUC limitations: 20 min
- PR curve, decision curve: 15 min
- Prediction vs. causation, DAG: 20 min

---

# Lecture 5 — Production, Governance & Monitoring

**Posts covered:** 10 (Clinical decision support), 12 (SPC for registry data), 13 (Governance lifecycle)

## Teaching strategy

The series ends where clinical AI must go: from model to deployment to monitoring. Most clinical AI discourse ends with "we built a model with AUC 0.87." This lecture says: that's where the work begins. Deployment requires calibration monitoring, alert threshold governance, SPC for process metrics, and a lifecycle that includes sunsetting.

The O/E ratio SPC chart is the most operational visualization in the series. It answers the question clinical leadership actually needs answered: "Is our current performance consistent with expected performance, and when does a signal trigger action?"

## Key talking points

**Slide: Threshold Sensitivity**
The threshold sensitivity curve shows: as you lower the alert threshold, sensitivity rises and specificity falls. There is no objectively correct threshold — it depends on the clinical context, the cost of false positives (alert fatigue, unnecessary intervention), and the cost of false negatives (missed high-risk patients). Pre-specify the threshold and justify it in the deployment documentation.

**Slide: AUC-Stable, Calibration-Failing**
This dual visualization is the monitoring case for calibration as a separate metric from AUC. The model's ranking holds even as the absolute risk estimates drift. A monitoring system that watches only AUC will miss this failure. Both metrics must be monitored with separate alert thresholds.

**Slide: SPC — O/E Ratio Chart**
The O/E ratio is the standardized performance metric: observed deaths divided by model-expected deaths. A stable registry should show O/E fluctuating around 1.0. Control limits (2-3 standard deviations from 1.0) define what "normal" looks like. A signal: O/E sustained above the upper control limit for 3+ consecutive months. That's when investigation is triggered.

**Slide: Governance Lifecycle**
Walk through all five phases: proposal → validation → deployment → monitoring → retirement. The retirement phase is often omitted from governance frameworks. Models that are retired without explicit policy don't go away — they persist in clinical workflows where no one remembers to update them. Sunset criteria and transition protocols are governance requirements, not optional.

## Timing
- Threshold sensitivity and calibration monitoring: 20 min
- SPC and O/E ratio chart: 20 min
- Governance lifecycle: 15 min

## Series-Level Discussion Questions

1. A registry analyst presents a logistic regression model for 30-day trauma mortality with AUC 0.83 and Hosmer-Lemeshow p = 0.22. What additional information do you need before endorsing its use in clinical decision support?

2. Your facility's O/E mortality ratio rises to 1.35 in Q3. The upper control limit is 1.28. What do you do? What are the three most likely explanations?

3. A trauma surgeon wants to use the registry risk score to help decide who gets aggressive resuscitation in mass casualty events. What are the ethical and statistical reasons this use case requires additional validation beyond the original derivation study?

4. Missing lab data in your registry is 18% for lactate, 8% for INR, 22% for base deficit. Before running your main analysis, what must you document about this missingness?

5. The hierarchy in your registry shows ICC = 0.14 for facility effects on 30-day mortality. You run a flat logistic model anyway because the multilevel model is harder to explain. What is the consequence for the inference from the flat model?