---
title: "Why Most Clinical Models Fail in the Real World (And How to Fix Them in R)"
date: "2023-11-01"
categories: ["Trauma Registry and Other Topics", "Clinical Modeling", "Prediction Models"]
description: "Why clinical prediction models fail after deployment, and how calibration, validation, reporting, and context-aware modeling make them more defensible."
bibliography: ../../../refs/stats-series.bib
---

## Executive Summary

Most clinical prediction models do not fail because they are poorly fit.
They fail because they are **misaligned with reality**.

Despite impressive performance metrics during development, many models degrade—or collapse entirely—once deployed into real clinical environments. That pattern is well recognized in the clinical prediction literature, where weak reporting, optimism, and poor attention to bias often undermine apparent performance [@steyerberg2019_clinical_prediction_models; @moons2015_tripod; @wolff2019_probast]. This post explains *why* that happens and presents a set of **practical, defensible fixes** using R.

This is not a critique of machine learning.
It is a critique of **modeling without context**.

---

## 1. The Myth of “Good Performance”

### 1.1 The Comfortable Illusion of Validation

Clinical models are often declared “successful” based on:

* High AUROC
* Narrow confidence intervals
* Statistically significant predictors

Yet these metrics are frequently evaluated:

* On curated datasets
* Under stable assumptions
* Without operational constraints

That is exactly why discrimination alone is not enough: transport to practice depends on calibration, case-mix, outcome definition, and workflow fit, not just apparent development performance [@steyerberg2019_clinical_prediction_models; @collins2024_tripodai].

```{r}
# Illustrative example: a model that looks great on paper
library(tidymodels)

load("~/Google Drive/Data_InDeed/projects/Subclavian/input/TQIP/TRAUMA.RData")

# ============================================================
# NTDS 2022 demo: Risk-adjusted mortality model + changes over time
#   - Logistic regression (TRISS-like)
#   - Evaluate AUC + calibration slope by time bins (e.g., 4-year)
#   - Keeps NAs in the dataset; modeling uses na.exclude
# ============================================================

# ---- Packages (explicit notation) ----
required_pkgs <- c("dplyr", "stringr", "lubridate", "pROC", "ggplot2", "tibble", "purrr")
missing_pkgs <- required_pkgs[!vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(missing_pkgs) > 0) stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))

# ---- Helper: safe numeric ----
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))

# ---- 1) Map your NTDS 2022 column names here ----
# Edit these to match *your* data. The rest of the code uses the standard names.
var_map <- list(
  mort      = "HOSP_MORT",     # 0/1 or Y/N
  age       = "AGE",           # years
  iss       = "ISS",           # 1-75
  gcs       = "GCS_TOTAL",     # 3-15
  sbp       = "SBP_ED",        # ED SBP (or first SBP)
  sex       = "SEX",           # M/F/U
  race      = "RACE",
  mech      = "INJ_MECH",      # blunt/penetrating/other or ICD E-code category
  admit_dt  = "ADMIT_YEAR" # date/time (preferred) or date
  #tc_level  = "TC_LEVEL"       # trauma center level, if present
)

TRAUMA$HOSP_MORT <- ifelse(TRAUMA$HOSPDISCHARGEDISPOSITION %in% c("Expired",
                                                                  "Deceased/Expired"), 
                           1, 0 )
TRAUMA$AGE <- as.numeric(TRAUMA$AGEYEARS)
TRAUMA$ISS <- as.numeric(TRAUMA$ISS)
TRAUMA$GCS_TOTAL <- as.numeric(TRAUMA$TOTALGCS)
TRAUMA$SBP_ED <- as.numeric(TRAUMA$SBP)
TRAUMA$SEX <- factor(TRAUMA$SEX, levels = c("Male", "Female"))

derive_race <- function(df) {
  df |>
    dplyr::mutate(
      RACE = dplyr::case_when(
        WHITE == "Yes" ~ "White",
        BLACK == "Yes" ~ "Black",
        ASIAN == "Yes" ~ "Asian",
        AMERICANINDIAN == "Yes" ~ "AI/AN",
        PACIFICISLANDER == "Yes" ~ "NH/PI",
        RACEOTHER == "Yes" ~ "Other",
        TRUE ~ "Unknown"
      ),
      RACE = factor(
        RACE,
        levels = c("White", "Black", "Asian", "AI/AN", "NH/PI", "Other", "Unknown")
      )
    )
}

TRAUMA <- derive_race(TRAUMA)

derive_sex <- function(df, sex_var = "SEX") {
  df |>
    dplyr::mutate(
      sex_raw = as.character(.data[[sex_var]]),
      SEX = dplyr::case_when(
        sex_raw == "Male"   ~ "Male",
        sex_raw == "Female" ~ "Female",
        TRUE                ~ "Unknown"
      ),
      SEX = factor(sex, levels = c("Male", "Female", "Unknown"))
    )
}

TRAUMA <- derive_sex(TRAUMA)

derive_mech <- function(df, mech_var = "PRIMARY_TRAUMATYPE") {
  df |>
    dplyr::mutate(
      mech_raw = as.character(.data[[mech_var]]),
      INJ_MECH = dplyr::case_when(
        is.na(mech_raw) ~ "Unknown",
        mech_raw == "Activity Code - Not Valid as a Primary E-Code" ~ "Unknown",
        mech_raw %in% c("Blunt", "Penetrating", "Burn", "Other/unspecified") ~ mech_raw,
        TRUE ~ "Unknown"
      ),
      INJ_MECH = factor(
        INJ_MECH,
        levels = c("Blunt", "Penetrating", "Burn", "Other/unspecified", "Unknown")
      )
    )
}

TRAUMA <- derive_mech(TRAUMA)
TRAUMA$ADMIT_YEAR <- as.numeric(TRAUMA$ADMIT_YEAR)

# ---- 2) Standardize into analysis-ready fields ----
prep_ntds_demo <- function(df, var_map) {

  # check presence
  needed <- unlist(var_map, use.names = FALSE)
  missing_cols <- setdiff(needed, names(df))
  if (length(missing_cols) > 0) {
    stop("These mapped columns are missing in df: ", paste(missing_cols, collapse = ", "))
  }

  df2 <- df |>
    dplyr::mutate(
      # outcome
      mort_raw = .data[[var_map$mort]],
      mort = dplyr::case_when(
        mort_raw %in% c(1, "1", "Y", "YES", "Yes", "yes", TRUE) ~ 1L,
        mort_raw %in% c(0, "0", "N", "NO", "No", "no", FALSE)  ~ 0L,
        TRUE ~ NA_integer_
      ),

      # core predictors
      age = to_num(.data[[var_map$age]]),
      iss = to_num(.data[[var_map$iss]]),
      gcs = to_num(.data[[var_map$gcs]]),
      sbp = to_num(.data[[var_map$sbp]]),

      # modest cleaning / bucketing
      sex = dplyr::case_when(
        stringr::str_to_upper(as.character(.data[[var_map$sex]])) %in% c("M", "MALE") ~ "M",
        stringr::str_to_upper(as.character(.data[[var_map$sex]])) %in% c("F", "FEMALE") ~ "F",
        TRUE ~ "U"
      ),
      mech_raw = stringr::str_to_upper(as.character(.data[[var_map$mech]])),
      mech = dplyr::case_when(
        stringr::str_detect(mech_raw, "PENET") ~ "PENETRATING",
        stringr::str_detect(mech_raw, "BLUNT") ~ "BLUNT",
        TRUE ~ "OTHER"
      ),
      race = factor(
        .data[[var_map$race]],
        levels = c("White", "Black", "Asian", "AI/AN", "NH/PI", "Other", "Unknown")
      ),
      #tc_level = as.character(.data[[var_map$tc_level]]),

      # time fields
      # admit_dt_raw = .data[[var_map$admit_dt]],
      # admit_dt = lubridate::as_datetime(admit_dt_raw, tz = "UTC"),
      # admit_date = as.Date(admit_dt),
      #year = lubridate::year(admit_dt),
      
          # year field (since ADMIT_YEAR is numeric)
      admit_year = to_num(.data[[var_map$admit_dt]]),

      # year bins (change as desired)
      year_bin = cut(
        admit_year,
        breaks = c(2006, 2010, 2014, 2018, 2022),
        labels = c("2007-2010", "2011-2014", "2015-2018", "2019-2022"),
        right = TRUE
      )
    ) |>
    # basic plausibility trimming (does NOT drop NAs; only caps impossible values)
    dplyr::mutate(
      iss = dplyr::if_else(iss < 0 | iss > 75, NA_real_, iss),
      gcs = dplyr::if_else(gcs < 3 | gcs > 15, NA_real_, gcs),
      sbp = dplyr::if_else(sbp < 0 | sbp > 300, NA_real_, sbp),
      age = dplyr::if_else(age < 0 | age > 110, NA_real_, age)
    )

  df2
}

df_demo <- prep_ntds_demo(df = TRAUMA, var_map = var_map) 

df_demo <- df_demo |> 
  dplyr::select(c(names(var_map)[-9], year_bin)) 


# ---- 3) Fit a TRISS-like logistic model (with simple nonlinearity via splines if you want) ----
# Base model: mort ~ iss + age + gcs + sbp + mech + sex + tc_level + year_bin
# Note: na.action=na.exclude keeps row alignment for predictions.

fit_model <- function(dat) {
  stats::glm(
    mort ~ iss + age + gcs + sbp + mech + sex + race + year_bin,
    data = dat,
    family = stats::binomial(),
    na.action = stats::na.exclude
  )
}

model_df <- df_demo |>
  dplyr::select(mort, age, iss, gcs, sbp, sex, race, mech, year_bin)

# how many levels are actually present (including NA shown separately)
lvl_check <- list(
  sex      = dplyr::count(model_df, sex),
  mech     = dplyr::count(model_df, mech),
  race     = dplyr::count(model_df, race),
  year_bin = dplyr::count(model_df, year_bin)
)

lvl_check$sex
lvl_check$mech
lvl_check$year_bin
lvl_check$race

m <- fit_model(df_demo)

# ---- 4) Evaluate AUC + calibration slope by year_bin ----
calc_auc <- function(y, p) {
  ok <- is.finite(y) & is.finite(p)
  if (sum(ok) < 50 || length(unique(y[ok])) < 2) return(NA_real_)
  as.numeric(pROC::auc(pROC::roc(response = y[ok], predictor = p[ok], quiet = TRUE)))
}

calc_cal_slope <- function(y, p) {
  ok <- is.finite(y) & is.finite(p) & p > 0 & p < 1
  if (sum(ok) < 50 || length(unique(y[ok])) < 2) return(NA_real_)
  lp <- stats::qlogis(p[ok])
  # calibration slope: logistic regression of y on lp
  s <- stats::glm(y[ok] ~ lp, family = stats::binomial())
  unname(stats::coef(s)[2])
}

df_demo <- df_demo |>
  dplyr::mutate(
    p_hat = stats::predict(m, newdata = df_demo, type = "response")
  )

perf_by_bin <- df_demo |>
  dplyr::group_by(year_bin) |>
  dplyr::summarise(
    n = dplyr::n(),
    n_obs_outcome = sum(!is.na(mort)),
    auc = calc_auc(mort, p_hat),
    cal_slope = calc_cal_slope(mort, p_hat),
    .groups = "drop"
  )

print(perf_by_bin)

# ---- 5) Simple plots: AUC and calibration slope over time bins ----
p_auc <- ggplot2::ggplot(perf_by_bin, ggplot2::aes(x = year_bin, y = auc, group = 1)) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::labs(x = "Year bin", y = "AUC", title = "NTDS: Discrimination over time") +
  ggplot2::theme_minimal()

p_cal <- ggplot2::ggplot(perf_by_bin, ggplot2::aes(x = year_bin, y = cal_slope, group = 1)) +
  ggplot2::geom_hline(yintercept = 1, linetype = 2) +
  ggplot2::geom_line() +
  ggplot2::geom_point() +
  ggplot2::labs(x = "Year bin", y = "Calibration slope", title = "NTDS: Calibration slope over time") +
  ggplot2::theme_minimal()

print(p_auc)
print(p_cal)

```

What this does *not* tell us:

* Whether the data will look the same tomorrow
* Whether clinicians can act on the prediction
* Whether errors are symmetric or catastrophic

---

## 2. Failure Mode #1: Dataset Shift Is the Default, Not the Exception

### 2.1 Temporal Drift

Clinical practice evolves:

* New protocols
* New documentation standards
* New patient populations

```r
# Compare distributions across time
library(ggplot2)

ggplot(data, aes(x = systolic_bp, fill = era)) +
  geom_density(alpha = 0.4)
```

A model trained on “historical truth” often becomes **historically accurate and clinically wrong**.

### 2.2 Fix: Stress-Test Before Deployment

* Train on early data
* Validate on later data
* Repeat continuously

```r
# Rolling-origin evaluation
rs <- rolling_origin(
  data,
  initial = 365,
  assess = 90,
  cumulative = FALSE
)
```

---

## 3. Failure Mode #2: Missing Data Is Not a Nuisance—It’s a Signal

### 3.1 Why Complete-Case Analysis Fails Clinically

In real clinical data:

* Missingness is informative
* Documentation reflects workflow, not biology
* “Not recorded” often means “not measurable”

```r
library(naniar)

vis_miss(data)
```

Removing missing data frequently removes **the sickest patients**.

### 3.2 Fix: Model the Missingness

Options include:

* Multiple imputation
* Joint modeling
* Sensitivity analyses

```r
library(mice)

imp <- mice(data, m = 5, method = "pmm")
completed <- complete(imp)
```

The ethical question is not *how to eliminate missingness*, but **how honest your uncertainty is**.

---

## 4. Failure Mode #3: Metrics That Optimize the Wrong Decision

### 4.1 AUROC Is Not a Clinical Objective

In many clinical settings:

* Outcomes are rare
* False negatives dominate harm
* Thresholds matter more than ranking

```r
library(yardstick)

roc_auc(preds, truth, .pred_1)
pr_auc(preds, truth, .pred_1)
```

A model with excellent AUROC can still:

* Miss critical cases
* Overwhelm clinicians with false alarms

### 4.2 Fix: Evaluate What You’re Willing to Act On

* Precision–recall curves
* Decision curve analysis
* Threshold sensitivity

```r
library(rmda)

decision_curve(
  outcome ~ risk_score,
  data = data,
  thresholds = seq(0.01, 0.5, by = 0.01)
)
```

---

## 5. Failure Mode #4: Ignoring Hierarchy and Context

### 5.1 Flat Models in Hierarchical Systems

Clinical care is hierarchical:

* Patients nested in units
* Units nested in hospitals
* Hospitals nested in systems

Ignoring this structure leads to:

* Overconfident estimates
* Spurious site effects
* Poor generalization

### 5.2 Fix: Partial Pooling

```r
library(brms)

fit_hier <- brm(
  outcome ~ predictor + (1 | site),
  data = data,
  family = bernoulli()
)
```

Hierarchical models:

* Improve stability
* Reflect real-world variation
* Are often *more ethical* because they reduce overfitting

---

## 6. Failure Mode #5: Treating Prediction as the Final Product

### 6.1 Models Don’t Make Decisions—People Do

Clinical models fail when:

* Outputs lack context
* Uncertainty is hidden
* Responsibility is unclear

A prediction without:

* Provenance
* Assumptions
* Limitations
  is not decision support—it’s noise.

---

## 7. Fixing the Pipeline, Not Just the Model

### 7.1 What “Production-Ready” Actually Means

A defensible clinical model includes:

* Data versioning
* Assumption documentation
* Performance monitoring
* Explicit failure conditions

```r
sessionInfo()
```

### 7.2 Audit-Ready by Design

* Reproducible reports
* Logged predictions
* Model cards generated automatically

```r
rmarkdown::render("model_report.Rmd")
```

---

## 8. A Different Definition of Success

A successful clinical model is not one that:

* Maximizes a metric
* Impresses reviewers
* Looks elegant in isolation

It is one that:

* Fails gracefully
* Communicates uncertainty
* Improves decisions under pressure

---

## Closing Thoughts

Most clinical models fail because they are **optimized for publication, not practice**.

Fixing this does not require exotic algorithms.
It requires:

* Statistical humility
* Operational awareness
* Ethical clarity

R gives us the tools.
The responsibility is ours.

---

## Appendix

* Reproducibility checklist
* Simulation code
* Sensitivity analysis templates

---

## Series Callout

{{< include ../series-callout.qmd >}}