---
title: "Ethics of Clinical AI — Master Speaker Notes"
subtitle: "Instructor Teaching Guide · 4-Lecture Series"
author: "Jonathan D. Stallings, PhD, MS"
date: "Summer 2026"
format:
  html:
    toc: true
    toc-depth: 3
    toc-title: "Lecture Navigator"
    number-sections: true
    theme: cosmo
---

> **How to use this guide.** Instructor-facing notes for the 4-lecture Ethics of Clinical AI series. Audience: clinical faculty, data scientists, AI governance teams, and military medical leadership. This series is deliberately provocative — it challenges comfortable assumptions about interpretability, human oversight, and data quality. The instructor's job is to hold that discomfort productively and redirect it toward institutional action.

---

# Lecture 1 — Opacity, Accountability & Ethical Failure Modes

**Posts covered:** 01 (When black boxes are defensible), 02 (Accountability without interpretability), 03 (Ethical failure modes in registry data)

## Teaching strategy

Open with the question: "How many of you would refuse to use a clinical tool unless you could explain its internal mechanism?" Most hands go up. Then: "How many of you understand the biochemistry behind the troponin assay?" Fewer hands. "The pharmacokinetics of your most-used drug?" Even fewer.

The point isn't gotcha — it's to surface the inconsistency in the demand for interpretability. We routinely use clinical tools we don't mechanistically understand. The ethical standard is validation, calibration, and governance — not glass-box transparency.

The ethical failure modes section (Part 3) is the most operationally concrete part of the lecture. Each failure mode has a specific detection strategy and a specific governance intervention. Don't let it feel like a theoretical list — make the audience identify which failure modes they've seen in their own practice.

## Key talking points

**Slide: The Standard Critique of Black Box Models**
Walk through each of the four embedded assumptions:
1. Interpretability produces understanding → No: SHAP values are local approximations of a complex function. They are not mechanism explanations.
2. Understanding is required for use → No: We use lab tests without understanding assay chemistry.
3. Clinicians routinely interrogate tools at mechanism level → No: Most clinicians take the number and use it.
4. Post-hoc explanation = trustworthiness → No: Explanation and justification are different (covered later).

**Slide: Interpretability-Performance Tradeoff**
The scatter plot is the strongest argument in this section. If the interpretable model has meaningfully lower AUC in a mortality prediction task, the demand for interpretability has clinical costs. The ethics of that tradeoff deserves explicit discussion — not a default assumption that interpretability wins.

**Slide: Explanation vs. Justification**
This is the most conceptually important slide in the lecture. SHAP values explain: they tell you which features drove this prediction. Validation evidence justifies: it tells you the model's predictions are reliable enough to support this clinical decision in this population. These are not the same thing. A model can have beautiful SHAP explanations and terrible calibration in the deployed population.

**Slide: The Accountability Stack**
Ask the audience: "When a patient is harmed by a model-assisted decision, who is accountable?" Walk through each layer of the stack. The answer: the institution that deployed the model and the governance body that authorized it bear the primary accountability — not the model, not the developer, and only partially the individual clinician. This is uncomfortable for institutions that would prefer to locate accountability at the model or the clinician level.

**Slide: Temporal Bias — The Key Visualization**
The scatterplot showing admission ISS vs. discharge ISS is subtle but important. Walk through the argument: "We train a model to predict 30-day mortality at the time of admission. If we include discharge-revised ISS as a predictor, the model uses information that wasn't available at admission. At deployment, admission-time ISS is all we have. The model looks great in training and fails silently at deployment."

## Timing
- Opacity and interpretability: 20 min
- Accountability: 20 min
- Ethical failure modes: 20 min

## Discussion prompts
"A vendor presents a clinical AI model with beautiful SHAP visualizations showing exactly which features drove each prediction. They have no prospective validation data from your patient population. Which matters more — the SHAP explanations or the validation data? Why?"

"A Black patient receives an alert from a sepsis model. The alert fires at the same threshold for all patients. The model was trained on a dataset where Black patients had 22% more missing labs than white patients. Name the failure mode, name who is accountable for it, and name what governance intervention would catch it."

---

# Lecture 2 — Prediction, Human Oversight & The Ethics of Data Exclusion

**Posts covered:** 04 (Prediction vs. responsibility), 05 (Human-in-the-loop), 06 (Ethics of excluding messy patients)

## Teaching strategy

This lecture is about three places where ethics shows up in the workflow that clinicians don't typically recognize as ethical decision points: when they act on a risk score, when they nominally "review" an AI recommendation, and when analysts decide who to include in a model.

The tone here should be empathetic but direct. None of the failures described are malicious — they're structural. The clinician rubber-stamping the algorithm is doing so under cognitive load in a system that wasn't designed to support meaningful review. The analyst excluding incomplete patients is following methodological convention without recognizing the population-level implications. The goal is to make these structural failures visible so they can be addressed at the governance level.

## Key talking points

**Slide: How Risk Scores Become Moral Shortcuts**
The cognitive load simulation is the key visualization. Spend time on this. Ask: "In your practice, when are you most cognitively loaded?" Mass casualty events, overnight shift, post-call, middle of a code. Those are exactly the conditions when algorithm reliance rises and override rates fall. The ethical risk is highest when the stakes are highest.

**Slide: Historical Data Are Not Moral Ground Truth**
This is the most philosophically challenging slide in the lecture. Walk through the laundering argument carefully:
- Historical trauma data reflects historical care decisions
- Historical care decisions reflected system-level resource allocation
- Resource allocation was not always equitable
- A model trained on this data learns to reproduce historical patterns — including historical inequities
- Using "the model learned from data" as justification for current decisions endorses everything embedded in that history

This argument should be met with some discomfort. Hold the discomfort — don't rush to resolve it. The appropriate response is: scrutinize the training data before training the model.

**Slide: Automation Bias — Override Rate by Interface Design**
The key finding: adding visual explanation (bar charts, SHAP) does not reduce automation bias. Showing uncertainty and calibration information helps modestly. Ask the audience: "What does this mean for how we should design clinical AI interfaces?" The interface is a governance decision — it's not just UX.

**Slide: HITL as Liability Transfer**
This slide requires careful framing. It's not accusing any specific institution — it's describing a structural pattern. "The institution designed a system in which meaningful review was unlikely, then used the formal override mechanism to claim oversight was present." This is a governance failure masquerading as human oversight.

**Slide: Complete-Case ISS Distribution**
The density shift plot is the central visualization for the data exclusion section. Ask the audience: "Which group of patients is most affected by this exclusion? What does this mean for who the model actually serves? And when the model is deployed in a mass casualty event with high documentation incompleteness, what will happen to its performance?"

## Timing
- Prediction vs. responsibility: 20 min
- Human-in-the-loop: 20 min
- Data exclusion ethics: 15 min

---

# Lecture 3 — Fairness, Performance Monitoring & The Ethics of Automation

**Posts covered:** 07 (Missingness as fairness), 08 (AI performance monitoring), 09 (CPG compliance automation)

## Teaching strategy

Lecture 3 is the most infrastructure-oriented of the four. The key message: fairness, monitoring, and responsible automation are all governance architecture problems — not algorithm problems. You cannot add fairness constraints to a model and solve a structural data collection inequity. You cannot label a model "monitored" without specifying who monitors, how often, what thresholds trigger action, and what happens when an alert fires.

The ethics-of-speed framing for CPG compliance automation is often counterintuitive: most audiences expect "automation" to raise ethical concerns about accuracy or job displacement. The actual ethical concern is the opposite — selective slowness, where monitoring is fast for outcomes that protect institutions and slow for care processes that could implicate them.

## Key talking points

**Slide: Differential Missingness Bar Chart**
The side-by-side bars showing missingness by group are the visual anchor for the fairness section. Ask: "If Group B has 34% missing labs vs. Group A's 9%, and we train a model on the complete cases from both groups, which group's model performance is more uncertain?" Group B — by a large margin. Any fairness constraint applied to a model trained on this data is operating on an unstable foundation.

**Slide: "Fair" AUC, Unfair Calibration**
The calibration plot showing two groups with equal AUC but different calibration is the key technical result. Walk through the clinical consequence: the model is firing alerts at the same probability threshold for both groups. Group B's alerts fire with different clinical meaning than Group A's. The same alert means different things in different patients — and the model doesn't know it.

**Slide: Model Drift — The Monitoring Timeline**
Show the dual-panel monitoring figure. Ask the audience: "What would you see without monitoring?" Just two vertical lines (the protocol changes) and no indication that anything changed in model performance. The harm would accumulate silently. How long before someone noticed clinically that outcomes were different from expected? Weeks? Months?

**Slide: Delphi Mandate — Monitoring Is Required**
Read the Delphi consensus statement directly: "Continuous performance monitoring is a required component of responsible clinical AI deployment." Not "recommended." Required. Then ask: "Which models deployed in this institution have prospective outcome monitoring in place? Who reviews the performance dashboard? What triggers a review?"

**Slide: Five-Level Automation Framework**
Walk through each level with the governance implications:
- L0 (manual): slow, no scalability, human error rate
- L1 (rule-based): fast for simple rules, brittle when definitions change
- L2 (NLP-assisted): requires validation of extraction accuracy
- L3 (ML-scored): requires ongoing model performance monitoring
- L4 (autonomous): requires audit trail, test patients, exception escalation protocol

The ethics are not in the level — they're in whether the governance scales with the level.

## Timing
- Missingness as fairness: 20 min
- AI performance monitoring: 20 min
- Ethics of CPG automation: 15 min

## Discussion prompts
"Your institution's sepsis AI is classified as L3 automation. It has been running for 18 months. Who reviews its performance? What metrics? At what frequency? What would trigger a review? If you can't answer all four, what does that mean?"

"A CPG compliance system reports that 91% of eligible patients received early resuscitation. The data is 6 weeks old. A care deviation began 5 weeks ago. How many patients experienced the deviation before it would be detected? What infrastructure would be needed to reduce that window to 48 hours?"

---

# Lecture 4 — Semantic Infrastructure, Responsible AI & DoDTR Modernization

**Posts covered:** 10 (Ontology as ethics), 11 (Responsible AI beyond the checklist), 12 (DoDTR modernization as ethical imperative)

## Teaching strategy

Lecture 4 is the capstone — it integrates the infrastructure themes from the OMOP series, the governance themes from the trauma registry series, and the ethical frameworks from the preceding three lectures into a single argument: the DoDTR modernization is not a technology upgrade. It is an ethical obligation, and it should be governed accordingly.

The most challenging moment in this lecture for clinical audiences is the "body count" framing. Be direct about it without being sensationalist: "Delayed feedback on care processes means patients experience care deviations for weeks or months before a corrective protocol can be issued. That is a quantifiable harm with a quantifiable magnitude. Call it what it is."

## Key talking points

**Slide: Local Codes Are an Ethical Problem**
Walk through the concrete examples: "GSW_ABD" vs. "ABDO_PEN" vs. "PENET_TORSO." Each sounds like penetrating abdominal trauma. Each may include or exclude blast injury differently. A meta-analysis that pools these under the same label is not analyzing the same clinical entity. If that meta-analysis informs a CPG, the CPG is based on a definitional fiction.

**Slide: Semantic Infrastructure Layers**
The bars showing ethical risk vs. interoperability gain make the tradeoff concrete. Moving up the layers (from local codes to ontology networks) requires more investment but reduces ethical risk and improves interoperability. The first bar (local codes) should prompt the question: "How much of your current data infrastructure lives here?"

**Slide: Ontology Governance as a Political Act**
Read the governance questions aloud. "Who decides what 'penetrating trauma' means in the DoDTR? Who decides when that definition changes?" Then ask the audience: "In your registry, who is that person? Is there a formal process?" Most registries have no answer. The silence is the finding.

**Slide: The Six Gaps**
Walk through each gap in the Delphi consensus analysis:
- Gap 1 ("Contemporary methods" is an escape hatch): If the consensus says "use contemporary methods," who decides what's contemporary in 3 years?
- Gap 2 (AI literacy as precondition): The consensus calls for literacy but doesn't specify it. A clinician who clicks "accept" on an alert has nominal literacy. That's not enough.
- Gap 3 (Local validation required): Global validation doesn't transfer. A model validated on civilian Level 1 trauma data is not validated for forward-deployed military trauma.
- Gap 4 (Audit trail must include the prompt): For LLM-assisted workflows, the input to the model is part of the audit trail. Most governance frameworks don't specify this.
- Gap 5 (Patient transparency): Patients have a right to know when AI contributed to a decision affecting them. Most systems have no mechanism for this.
- Gap 6 (PHI in LLM workflows): This is the most urgent gap. Frame it as a current, active risk — not a hypothetical.

**Slide: The Modernization Gap Chart**
The side-by-side bars comparing current state to required state across six dimensions should feel sobering. Ask the audience: "Which bar is closest to being filled? Which represents the most institutional resistance to fill?" Usually: technical infrastructure is easier than governance infrastructure. The governance items — named governance body, amendment authority, pre-specified retirement criteria — are where modernization stalls.

## Timing
- Semantic infrastructure: 20 min
- Responsible AI beyond the checklist: 20 min
- DoDTR modernization: 15 min

## Series-Level Discussion Questions

1. A hospital administrator says: "We have human review of every AI-assisted decision. We're covered." Walk through three ways this statement can be simultaneously true and ethically insufficient.

2. A risk score for trauma mortality was built on DoDTR data from 2018–2022 and deployed in 2024. No monitoring program was established. It's now 2026. What specific processes should have been in place, and what would you do now?

3. A vendor proposes a model for CPG compliance monitoring at L4 (fully autonomous). What five governance requirements would you demand in the contract before approval?

4. You are asked to evaluate whether a trauma AI model is "fair." What specific analyses would you run, and what criteria would you use to define "fair enough for deployment"?

5. Your institution is asked to adopt the DoDTR OMOP implementation and share data with a federated research network. What five items must be in your data sharing agreement, beyond what OMOP's CDM specification requires?

---

# Cross-Series Teaching Connections

Use these connections when teaching the Ethics series alongside other series:

| Ethics concept | Applied Statistics connection | Trauma Registry connection |
|---|---|---|
| Accountability traceability | Audit-ready analysis (Trauma L1) | Analysis contracts, SHA-256 hashing |
| Calibration as ethical property | Calibration plots (Applied L9) | AUC vs. calibration (Trauma L1) |
| Missingness as fairness | Missing data mechanisms (Advanced L1) | MNAR sensitivity (Trauma L3) |
| Automation bias | Human-in-the-loop (Ethics L2) | Governance lifecycle (Trauma L5) |
| Ontology governance | OMOP value-level metadata (OMOP L1) | Data provenance (Trauma L1) |
| Monitoring as ethical requirement | SPC O/E charts (Trauma L5) | Model drift (Ethics L3) |