Ethics of Clinical AI — Lecture 2 of 4
Data InDeed | dataindeed.org
2026-01-01
A risk score predicts. It does not decide. It does not justify. It does not bear responsibility.
Post 04 Prediction vs. Responsibility
Post 05 Human-in-the-Loop
Post 06 Ethics of Excluding Data
Risk scores predict — they do not legitimize
What a risk score does:
Given features X, a risk score outputs P(outcome | X, model). It summarizes the statistical relationship between observed features and a historical outcome in a specific population.
What a risk score does not do:
The ethical shift:
When a risk score of 0.85 is used to justify a care withholding decision, the question changes from “What does this patient need?” to “What does the model say about this patient’s probability of benefit?”
The trauma triage scenario:
A mass casualty event produces 40 patients. A triage algorithm assigns priority scores. Those with scores below 0.2 are designated expectant.
The score does not know: this patient has a contraindication to the standard protocol. This model was trained before this mechanism of injury was common. The model’s calibration has drifted since the last validation.
The score predicted. The human decided. But did the human review the score or ratify it?
The moral risk is highest exactly when the environment is most demanding. In trauma, mass casualty, and high-tempo operations — when cognitive shortcuts are most attractive — the conditions for automation bias are maximally present.
The laundering argument:
“We’re not making this decision — the model is. The model learned from thousands of historical cases.”
The problem:
If historical decisions reflected system-level inequities — differential access, varying standards of care, documentation gaps — then the model has learned to reproduce those inequities.
A model that was trained on data where patients from under-resourced settings had worse outcomes will predict worse outcomes for similar patients at deployment — and if that prediction is used to allocate resources, the inequity becomes self-perpetuating.
The historical data are not a moral baseline. They are a record of what happened — including what happened that was unjust. Using “the model learned from data” as a justification implicitly endorses everything embedded in that history.
Standard clinical settings:
Trauma / mass casualty settings:
The compounding effect:
In mass casualty events, a risk score that is probably right for most patients will occasionally be wrong. In a standard setting, the wrong decision is one tragedy. In a MASCAL with 40 patients processed in 20 minutes, the wrong decision rate is multiplied across a cohort — and the feedback loop for correction is hours or days away.
The stakes of prediction errors in trauma are not normally distributed. They are clustered at the worst moments.
Why nominal oversight often fails
The intuition:
“We have a human in the loop. The model recommends; a clinician decides. Therefore accountability is preserved.”
What this gets right:
What this misses:
The critical question HITL rarely answers:
Does the human have enough information, time, and cognitive bandwidth to exercise meaningful judgment — or are they rubber-stamping the algorithm?
If the answer is rubber-stamping, HITL is not a safeguard. It is liability transfer.
The institution can say “a clinician reviewed every decision” while providing conditions under which real review was impossible.
Adding visual explanation (bar charts, SHAP) does not reduce automation bias. Showing calibration and uncertainty intervals does — modestly. The interface is a governance decision.
The liability transfer mechanism:
This is a governance failure, not a clinical failure. The institution designed a system in which meaningful review was unlikely — then used the form of HITL to shield itself from accountability.
What meaningful HITL requires:
All five are infrastructure investments — not just UI decisions.
When data cleaning becomes a moral decision
Common exclusion criteria in clinical modeling:
These sound methodologically reasonable. But:
Who is actually excluded:
The “messy” patients are disproportionately the most severe, most marginalized, and most operationally important.
A model validated on complete cases will underestimate mortality risk for severe patients — exactly the patients for whom accurate risk estimation matters most.
The uncomfortable observation:
Patients who generate “clean” data — complete records, standard presentations, routine follow-up — are systematically different from patients who generate “messy” data.
Clean data patients tend to have: better access to care before injury, more resources to engage with follow-up, injuries that present within standard clinical pathways, documentation infrastructure at their care site.
“Data quality” is not random. It is correlated with social, geographic, and operational determinants of health. Excluding on “quality” therefore systematically excludes on these determinants.
The model learns to work best for patients who already have the best access to care. This is the opposite of where clinical AI capacity is most needed.
Exclusion is defensible when:
Exclusion is not defensible when:
The ethical alternative to exclusion:
The discipline is not in excluding cleaner patients — it is in being explicit about what population your model actually serves.
Prediction vs. Responsibility
Human-in-the-Loop
Ethics of Data Exclusion
The meta-lesson: Prediction, oversight, and exclusion are three distinct ethical decision points — each with its own failure mode. None of them can be resolved by better statistics alone. All of them require institutional governance.
Fairness, Performance Monitoring & The Ethics of Automation
Posts 07, 08 & 09:
Data InDeed · Ethics of Clinical AI · Lecture 2 | ⚡ Open App