You Can’t Trust What You Don’t Track: AI Performance Monitoring in Clinical Systems

Ethics in Trauma Registry Analysis
Why validated-at-deployment is not the same as safe-to-use, and why continuous performance monitoring is an ethical requirement for clinical AI.
Published

April 1, 2025

Modified

June 9, 2026

Executive Summary

Most clinical AI systems are evaluated once.

They undergo validation on a held-out dataset, pass a performance threshold, and are deployed.

That is not ongoing safety.

That is a point-in-time snapshot pretending to be a guarantee.

Clinical populations change. Data systems change. Clinical practice changes. Models do not change themselves.

The result is model drift: the silent degradation of AI performance after deployment, in the absence of any mechanism to detect or respond to it.

This post argues that continuous performance monitoring is not an optional enhancement.

It is an ethical requirement for any clinical AI system operating at scale.


What Model Drift Actually Is

Model drift is not a technical curiosity.

It is a predictable consequence of deploying a fixed model into a changing world.

It takes several forms:

  • Data drift: the input distribution shifts — patients look different from the training set
  • Concept drift: the relationship between predictors and outcomes changes — care practices evolve
  • Population shift: new patient subgroups enter the system who were underrepresented at training
  • Label drift: the outcome itself is redefined, recoded, or measured differently

In clinical settings, all four can happen simultaneously.

A model monitoring hemorrhage risk trained on 2020 resuscitation data may quietly fail after 2024 protocol changes — and no one will know unless someone is watching.


The Cost of Invisible Degradation

When a model degrades silently, the consequences are not random.

They fall unevenly.

Because model drift disproportionately affects patients whose data patterns diverge from the training set, the first populations to be harmed are those already underrepresented at development (Obermeyer et al. 2019; Rajkomar et al. 2018).

This is not a worst-case scenario.

It is the expected outcome of deploying any model into a heterogeneous population without a monitoring infrastructure.

The ethical failure is not the drift itself.

It is operating a system where drift is invisible.


Validated at Deployment Is Not a Safety Guarantee

“We validated this model” is one of the most common and least informative things said about clinical AI.

Validation answers one question:

Did this model perform adequately on this dataset at this time?

It does not answer:

  • Does it still perform adequately today?
  • Does it perform across all demographic subgroups?
  • Does it generalize to institutions it was not developed at?
  • Has clinical practice shifted enough to change the outcome relationships?

Deployment validation is a threshold cleared, not a sustained condition.

Treating it as ongoing assurance is not negligence by malice.

It is negligence by omission.


The Delphi Mandate: Continuous Monitoring Is Not Aspirational

The Delphi consensus on responsible AI in clinical settings makes the monitoring obligation explicit.

Individual institutions should continuously monitor any integrated AI tools intended for clinical decision-making support for accuracy, bias, model drift, and unintended consequences.

The inclusion of model drift as a named monitoring target is significant.

It acknowledges the technical reality that performance degrades over time as real-world clinical data changes (Montgomery 2020; Davis et al. 2017).

This is not a recommendation to monitor when convenient.

It is a condition of responsible deployment.


What Monitoring Actually Means

Performance monitoring in clinical AI is not the same as running accuracy metrics once a quarter.

It requires four overlapping capabilities:

1. Calibration tracking

Is the model’s confidence still meaningful?

A model that predicts 80% risk when the true risk is 40% is not just imprecise — it drives the wrong decisions (Van Calster et al. 2019; Steyerberg et al. 2004).

Calibration should be measured continuously, not assumed.

2. Subgroup performance surveillance

Does performance hold across demographic groups, injury mechanisms, care settings, and time windows?

A model with good overall performance and poor subgroup performance is not performing ethically.

3. Drift detection

Statistical process control methods can detect when input distributions shift beyond training boundaries (Montgomery 2020).

Flagging distributional shifts before they corrupt predictions is basic operational safety.

4. Outcome feedback

Was the model’s prediction followed by the outcome it was predicting?

Linking predictions to outcomes in near-real-time creates the ground truth needed to evaluate ongoing performance.


Alert Mechanisms as Ethical Infrastructure

Monitoring without alerting is observation without accountability.

When performance falls below a threshold, there must be:

  • an automated alert to clinical governance leads,
  • a defined escalation path,
  • a protocol for temporary suspension or annotation,
  • and a reporting mechanism to affected healthcare professionals, oversight bodies, and — when patient safety is implicated — patients.

That last point is consistently underdeveloped.

If an AI system has been contributing to clinical decisions and its performance degrades, patients whose care was affected have a right to know.

This is not theoretical liability management.

It is basic respect for persons.


Expert Feedback Loops as Performance Data

One of the most valuable monitoring signals in a well-designed clinical AI system is expert annotation.

When a clinician overrides a model recommendation, that override contains information:

  • about what the model got wrong,
  • about which patient characteristics were unusual,
  • about which clinical presentations fall outside the training distribution.

Systems that log overrides, capture rationales, and feed them back into model improvement treat expert clinical judgment as a data source rather than an inconvenience.

This is what meaningful human-in-the-loop looks like:

Not a checkbox before deployment, but an ongoing relationship between clinical expertise and model performance (Sendak et al. 2020).


The Military Context: Monitoring Under Operational Constraints

Continuous monitoring is harder in deployed military settings than in academic medical centers.

Theater operations may involve:

  • limited connectivity,
  • higher data latency,
  • faster patient turnover,
  • different injury patterns than training data,
  • and reduced analyst capacity to review monitoring dashboards.

These constraints do not eliminate the monitoring obligation.

They make it more important.

A model deployed in a combat support hospital with no feedback loop and no drift detection is not a validated clinical tool.

It is an unmonitored inference engine operating at the point of care.

The DOD Trauma Registry modernization effort must account for monitoring architecture, not just model development.


When to Retire a Model

Monitoring creates the evidence base for model retirement.

The criteria for retiring a clinical AI model should be defined before deployment, not improvised when something fails.

Retirement criteria might include:

  • calibration slope falling below a defined threshold,
  • subgroup performance gap exceeding an acceptable margin,
  • training data becoming demographically or clinically unrepresentative,
  • new clinical practice guidelines invalidating the outcome definition,
  • or a major shift in the patient population that the model was never designed to serve.

These criteria must be organizational — maintained by governance committees and oversight bodies — not delegated to individual clinicians who may lack the technical access to evaluate model provenance (Mitchell et al. 2019).


Governance Structure for Performance Monitoring

Monitoring only works if someone owns it.

That requires:

  • a named responsible party for each deployed clinical AI system,
  • a defined monitoring protocol with explicit metrics and thresholds,
  • a review cadence that is continuous or triggered, not merely annual,
  • a clear escalation path from analyst to clinical leadership,
  • and a public-facing reporting mechanism when safety events occur.

Without governance structure, monitoring data sits in dashboards no one reads.

The infrastructure matters.

The ownership matters more.


A Practical Checklist Before Continuing to Run a Deployed Clinical AI

Before claiming a clinical AI system is safe to continue operating, ask:

  • When was it last evaluated against current clinical outcomes?
  • Has the input data distribution changed since training?
  • Has calibration been measured since deployment?
  • Does performance hold across all relevant subgroups?
  • Are there alert mechanisms if performance degrades?
  • Who is responsible for monitoring, and do they have access to the data?
  • Are there predefined criteria for suspension or retirement?

If the answer to most of these is “we’re not sure,” the system is not being operated ethically.


NoteWhere This Shows Up in AI/ML

FDA’s evolving framework for AI/ML SaMD distinguishes locked models from adaptive ones and requires post-market performance monitoring for both, but DoDTR-based models face a drift challenge that commercial SaMD frameworks were not designed for: injury patterns, casualty demographics, and treatment protocols shift with operational tempo, conflict type, and medical capability at point of injury. A model validated on OIF/OEF data may perform poorly in a peer-adversary conflict with different injury signatures — blast profiles, mass casualty ratios, evacuation timelines — without any monitoring system flagging the degradation. When drift goes undetected, clinicians continue receiving recommendations from a model that is increasingly miscalibrated for the population it is being asked to predict. Post-deployment surveillance is not a regulatory formality; in military health AI, it is the mechanism that prevents a valid model from becoming a dangerous one as the operational environment changes.

Closing: Deployment Is the Beginning, Not the End

Validating a model before deployment is the minimum.

Monitoring it continuously is the obligation.

The ethical standard for clinical AI does not end when the model goes live.

It begins there.

A clinical AI system without continuous performance monitoring is not a validated tool. It is an untested assumption that has been given a clinical interface.

That distinction — between a system that was once validated and a system that is currently monitored — is one of the most important in responsible AI deployment.

The DOD Trauma Registry, and any clinical system that relies on model outputs to guide care, must build monitoring infrastructure before claiming AI is operating safely.


This post is part of the Trauma Registry Analytics Toolkit — a companion reference with model monitoring templates, calibration drift detection scaffolds, performance surveillance code, and governance checklists for deployed clinical AI.

→ Open the Trauma Registry Analytics Toolkit


Series Callout

Note

This post is part of a broader Ethics in Trauma Registry Analysis Series:

  • Opacity Is Sometimes Ethical: When Black Boxes Save Lives
  • Accountability Without Interpretability: Who Owns a Model’s Decision?
  • Bias Isn’t Always Where You Think It Is: Ethical Failure Modes in Registry Data
  • Prediction vs Responsibility: Why Risk Scores Can Be Ethically Dangerous
  • Human-in-the-Loop Is Not a Panacea (and Sometimes a Lie)
  • The Ethical Implications of Excluding “Messy” Patients
  • Missingness as a Fairness Issue in Machine Learning
  • You Can’t Trust What You Don’t Track: AI Performance Monitoring in Clinical Systems
  • From Weeks to Minutes: The Ethics of Automating CPG Compliance
  • Ontology Is Not Optional: Semantic Infrastructure as Ethical Foundation
  • What Responsible AI in Clinical Guidance Actually Requires
  • Modernizing the DOD Trauma Registry: An Ethical and Technical Imperative

References

Davis, Sharon E., Thomas A. Lasko, Guanhua Chen, Edward D. Siew, and Michael E. Matheny. 2017. “Calibration Drift in Regression and Machine Learning Models for Acute Kidney Injury.” Journal of the American Medical Informatics Association 24 (6): 1052–61. https://doi.org/10.1093/jamia/ocx030.
Mitchell, Margaret, Simone Wu, Andrew Zaldivar, et al. 2019. “Model Cards for Model Reporting.” Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–29. https://doi.org/10.1145/3287560.3287596.
Montgomery, Douglas C. 2020. Introduction to Statistical Quality Control. 8th ed. Wiley.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.
Rajkomar, Alvin, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. 2018. “Ensuring Fairness in Machine Learning to Advance Health Equity.” Annals of Internal Medicine 169 (12): 866–72. https://doi.org/10.7326/M18-1990.
Sendak, Mark P., Jennifer D’Arcy, Sandeep Kashyap, et al. 2020. “A Path for Translation of Machine Learning Products into Healthcare Delivery.” EMJ Innovations 4 (1): 41–53.
Steyerberg, Ewout W., Gerard J. J. M. Borsboom, Hans C. van Houwelingen, Marinus J. C. Eijkemans, and J. Dik F. Habbema. 2004. “Validation and Updating of Predictive Logistic Regression Models: A Study on Sample Size and Shrinkage.” Statistics in Medicine 23 (16): 2567–86. https://doi.org/10.1002/sim.1844.
Van Calster, Ben, David J. McLernon, Maarten van Smeden, Laure Wynants, and Ewout W. Steyerberg. 2019. “Calibration: The Achilles heel of Predictive Analytics.” BMC Medicine 17 (1): 230. https://doi.org/10.1186/s12916-019-1466-7.