Calibration Toolkit (Slope, EWMA, Governance)

Toolkit
Calibration
A practical calibration toolkit for deployed clinical prediction models, including intercept and slope monitoring, EWMA-based drift detection, and governance checklists.
Published

March 8, 2026

Modified

June 9, 2026

Executive Summary

This appendix is a reusable calibration toolkit designed for deployed clinical prediction models.

It includes:

  • Calibration slope and intercept code (calibration-in-the-large and confidence drift)
  • EWMA threshold selection (how to set monitoring sensitivity without guesswork theater)
  • Governance checklist (what to document, log, and trigger when drift is detected)

Calibration is not a one-time evaluation. It is an operational obligation, especially for clinical prediction models that may drift over time or across settings (Steyerberg 2019; Van Calster et al. 2016).


1. Setup

library(dplyr)
library(tibble)
library(ggplot2)
library(qcc)

set.seed(20231101)

Assume you have production scoring logs (or evaluation data) with:

  • y (0/1 outcome)
  • p_hat predicted probability (0–1)
  • score_date date/time of prediction
  • optional grouping fields (site, service, unit, etc.)
# df <- readRDS("data_processed/prediction_log.rds")
# Required columns: y, p_hat, score_date

2. Calibration Slope and Intercept Code

2.1 Why slope and intercept matter

A model can keep stable AUROC while calibration deteriorates, which is one reason discrimination alone is insufficient for deployment monitoring (Harrell 2015; Steyerberg 2019). A model can keep stable AUROC while:

  • Intercept drifts (systematically too high/low)
  • Slope drifts (overconfident or underconfident)

You should monitor both.


2.2 Helper: safe logit transform

safe_logit <- function(p, eps = 1e-6) {
  p2 <- pmin(pmax(p, eps), 1 - eps)
  qlogis(p2)
}

2.3 Calibration-in-the-large (intercept only)

This estimates the additive shift needed to correct systematic bias while assuming the existing linear predictor is otherwise correct.

Model: \[ \text{logit}(P(Y=1)) = \alpha + \text{offset}(\text{logit}(\hat p)) \]

calibration_intercept <- function(y01, p_hat) {
  lp <- safe_logit(p_hat)
  fit <- glm(y01 ~ offset(lp), family = binomial())
  unname(coef(fit)[1])
}

Interpretation:

  • 0 means calibrated-in-the-large
  • positive means true risk > predicted
  • negative means predicted risk too high

2.4 Calibration slope + intercept (logistic recalibration)

Model: \[ \text{logit}(P(Y=1)) = \alpha + \beta \cdot \text{logit}(\hat p) \]

calibration_slope_intercept <- function(y01, p_hat) {
  lp <- safe_logit(p_hat)
  fit <- glm(y01 ~ lp, family = binomial())
  tibble(
    intercept = unname(coef(fit)[1]),
    slope = unname(coef(fit)[2])
  )
}

Interpretation of slope:

  • ~1 is ideal
  • <1 means predictions are too extreme (overconfident)
  • >1 means predictions are too timid (underconfident)

2.5 Windowed calibration metrics (monthly / weekly monitoring)

This converts calibration into a time series that SPC can monitor.

calibration_by_period <- function(df, period = "month", min_n = 200) {
  df %>%
    mutate(period = as.Date(cut(score_date, period))) %>%
    group_by(period) %>%
    summarise(
      n = n(),
      event_rate = mean(y, na.rm = TRUE),
      brier = mean((y - p_hat)^2, na.rm = TRUE),
      calib_int = calibration_intercept(y, p_hat),
      slope_int = list(calibration_slope_intercept(y, p_hat)),
      .groups = "drop"
    ) %>%
    tidyr::unnest_wider(slope_int) %>%
    filter(n >= min_n)
}

# usage
# cal_ts <- calibration_by_period(df, period = "month", min_n = 250)
# cal_ts

3. EWMA Threshold Selection (Practical, Audit-Friendly)

3.1 Why EWMA

EWMA is suited for:

It is especially useful when calibration drift is gradual rather than abrupt, which is common in production healthcare settings (Van Calster et al. 2016).

  • gradual drift
  • small persistent shifts
  • early detection without “false alarm storms”

In calibration monitoring, EWMA is often preferable to Shewhart charts.


3.2 What are we charting?

You can EWMA-chart any calibration metric; common choices:

  • calib_int (intercept drift)
  • slope (confidence drift)
  • brier (overall probabilistic degradation)

Start simple:

  • intercept EWMA is the cleanest operational signal

3.3 EWMA knobs you must justify

EWMA requires:

  • lambda (smoothing): typically 0.05–0.3
  • L (control limit width): typically 2.7–3.0

Audit-friendly framing:

  • lambda controls memory
  • L controls false alarms vs missed drift

3.5 Run EWMA and flag out-of-control points

ewma_flag <- function(x, mu0, sd0, lambda = 0.2, L = 3) {
  # qcc::ewma expects a center and std.dev if provided
  chart <- qcc::ewma(
    x,
    center = mu0,
    std.dev = sd0,
    lambda = lambda,
    nsigmas = L,
    plot = FALSE
  )

  # Identify points beyond limits
  ucl <- chart$limits[,2]
  lcl <- chart$limits[,1]
  ooc <- which(x > ucl | x < lcl)

  list(chart = chart, ooc = ooc, ucl = ucl, lcl = lcl)
}

Usage:

# cal_ts <- calibration_by_period(df, "month", min_n = 250)
# x <- cal_ts$calib_int

# baseline is first K periods (document your rule)
# K <- 6
# base <- 1:K
# pars <- ewma_fit_from_baseline(x, baseline_idx = base, lambda = 0.2, L = 3)
# res <- ewma_flag(x, mu0 = pars$mu0, sd0 = pars$sd0, lambda = pars$lambda, L = pars$L)

# res$ooc

3.6 Plot with explicit triggers

plot_ewma <- function(cal_ts, metric = "calib_int", ewma_res) {
  x <- cal_ts[[metric]]

  dfp <- tibble(
    period = cal_ts$period,
    x = x,
    ucl = ewma_res$ucl,
    lcl = ewma_res$lcl,
    flag = seq_along(x) %in% ewma_res$ooc
  )

  ggplot(dfp, aes(x = period, y = x)) +
    geom_line() +
    geom_point(aes(shape = flag)) +
    geom_line(aes(y = ucl)) +
    geom_line(aes(y = lcl)) +
    labs(
      title = paste0("EWMA Monitoring: ", metric),
      x = "Period",
      y = metric
    )
}

# usage
# plot_ewma(cal_ts, "calib_int", res)

3.7 Selecting lambda and L (a defensible rule)

A simple, documentable approach:

  • Use lambda = 0.2 for monthly monitoring (moderate memory)
  • Use L = 3 to balance false alarms vs missed drift
  • Validate sensitivity via simulation: inject a known intercept shift and verify detection delay

This is not perfect. It is transparent and testable.


4. Governance Checklist (Calibration in Production)

4.1 What to log (minimum viable audit trail)

Per prediction batch / period:

  • model identifier (version / hash)
  • training data fingerprint
  • scoring data fingerprint
  • prediction timestamp range
  • number scored (N)
  • observed outcome N (when available)
  • calibration intercept and slope
  • Brier score (or other pre-agreed metric)
  • alert status (in control / warning / action)

4.2 Define action thresholds and escalation paths

Document:

  • what constitutes a warning vs action
  • who receives alerts
  • acceptable time-to-review
  • acceptable time-to-remediation

Example policy language:

  • Warning: EWMA out-of-control 1 period OR slope outside [0.8, 1.2]
  • Action: 2 consecutive OOC periods OR intercept drift beyond clinically relevant margin

4.3 Approved remediation options (pre-specify)

Remediation must be pre-specified to avoid “moving goalposts.”

Typical options:

  1. Intercept-only recalibration (fast, transparent)
  2. Full logistic recalibration (slope + intercept)
  3. Stratified recalibration (by site/service if drift is localized)
  4. Retraining (only if concept drift is demonstrated)
  5. Temporary suspension (if drift indicates harm risk)

4.4 Documentation artifacts to retain

For each alert event:

  • calibration monitoring plot(s)
  • summary table of metrics
  • decision log (who decided what, when, why)
  • code snapshot used to compute metrics
  • any recalibration fit objects / parameters
  • post-fix validation results

NoteWhere This Shows Up in AI/ML

Calibration is the most deployment-critical and least-reported metric in clinical AI — a model with 0.85 AUC that overstates probabilities by 30% will cause clinicians to over-triage low-risk patients and exhaust resources on false alarms. FDA’s SaMD guidance explicitly calls out calibration as a required performance metric alongside discrimination, yet the majority of published clinical prediction models report only AUC. DoDTR-based trauma models deployed via MAVEN must be recalibrated when moved to a new MTF population, because calibration is far more sensitive to population shift than AUC. Skipping the reliability diagram and Brier score decomposition before deployment is not a shortcut — it is a guarantee that the model’s probability outputs will be wrong in a direction you have not characterized.

Closing Notes

Calibration is where deployed models become, in the language of modern prediction-model governance, either trustworthy or quietly unsafe (Steyerberg 2019; Osheroff et al. 2007).

Calibration is where deployed models become:

  • quietly unsafe, or
  • operationally trustworthy.

A calibration toolkit is not just analytics. It is governance made measurable.


Series Callout

Note

This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:

  • Bayesian Workflow Toolkit
  • Calibration Toolkit
  • Missing Data Toolkit
  • Rare Events Toolkit
  • Causal Inference Toolkit
  • Survival Analysis Toolkit
  • Prediction Modeling Toolkit
  • Real-World Evidence Toolkit
  • OMOP and Interoperability Toolkit
  • Trauma Registry Analytics Toolkit

References

Harrell, Jr., Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.
Osheroff, Jerome A., Jonathan M. Teich, Blackford Middleton, Eric B. Steen, Adam Wright, and Don E. Detmer. 2007. “A Roadmap for National Action on Clinical Decision Support.” Journal of the American Medical Informatics Association 14 (2): 141–45. https://doi.org/10.1197/jamia.M2334.
Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.
Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Ben De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74: 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.