Calibration Toolkit (Slope, EWMA, Governance)

Toolkit

Calibration

A practical calibration toolkit for deployed clinical prediction models, including intercept and slope monitoring, EWMA-based drift detection, and governance checklists.

Published

March 8, 2026

Modified

June 9, 2026

Executive Summary

This appendix is a reusable calibration toolkit designed for deployed clinical prediction models.

It includes:

Calibration slope and intercept code (calibration-in-the-large and confidence drift)
EWMA threshold selection (how to set monitoring sensitivity without guesswork theater)
Governance checklist (what to document, log, and trigger when drift is detected)

Calibration is not a one-time evaluation. It is an operational obligation, especially for clinical prediction models that may drift over time or across settings (Steyerberg 2019; Van Calster et al. 2016).

1. Setup

library(dplyr)
library(tibble)
library(ggplot2)
library(qcc)

set.seed(20231101)

Assume you have production scoring logs (or evaluation data) with:

y (0/1 outcome)
p_hat predicted probability (0–1)
score_date date/time of prediction
optional grouping fields (site, service, unit, etc.)

# df <- readRDS("data_processed/prediction_log.rds")
# Required columns: y, p_hat, score_date

2. Calibration Slope and Intercept Code

2.1 Why slope and intercept matter

A model can keep stable AUROC while calibration deteriorates, which is one reason discrimination alone is insufficient for deployment monitoring (Harrell 2015; Steyerberg 2019). A model can keep stable AUROC while:

Intercept drifts (systematically too high/low)
Slope drifts (overconfident or underconfident)

You should monitor both.

2.2 Helper: safe logit transform

safe_logit <- function(p, eps = 1e-6) {
  p2 <- pmin(pmax(p, eps), 1 - eps)
  qlogis(p2)
}

2.3 Calibration-in-the-large (intercept only)

This estimates the additive shift needed to correct systematic bias while assuming the existing linear predictor is otherwise correct.

Model: \[ \text{logit}(P(Y=1)) = \alpha + \text{offset}(\text{logit}(\hat p)) \]

calibration_intercept <- function(y01, p_hat) {
  lp <- safe_logit(p_hat)
  fit <- glm(y01 ~ offset(lp), family = binomial())
  unname(coef(fit)[1])
}

Interpretation:

0 means calibrated-in-the-large
positive means true risk > predicted
negative means predicted risk too high

2.4 Calibration slope + intercept (logistic recalibration)

Model: \[ \text{logit}(P(Y=1)) = \alpha + \beta \cdot \text{logit}(\hat p) \]

calibration_slope_intercept <- function(y01, p_hat) {
  lp <- safe_logit(p_hat)
  fit <- glm(y01 ~ lp, family = binomial())
  tibble(
    intercept = unname(coef(fit)[1]),
    slope = unname(coef(fit)[2])
  )
}

Interpretation of slope:

~1 is ideal
<1 means predictions are too extreme (overconfident)
>1 means predictions are too timid (underconfident)

2.5 Windowed calibration metrics (monthly / weekly monitoring)

This converts calibration into a time series that SPC can monitor.

calibration_by_period <- function(df, period = "month", min_n = 200) {
  df %>%
    mutate(period = as.Date(cut(score_date, period))) %>%
    group_by(period) %>%
    summarise(
      n = n(),
      event_rate = mean(y, na.rm = TRUE),
      brier = mean((y - p_hat)^2, na.rm = TRUE),
      calib_int = calibration_intercept(y, p_hat),
      slope_int = list(calibration_slope_intercept(y, p_hat)),
      .groups = "drop"
    ) %>%
    tidyr::unnest_wider(slope_int) %>%
    filter(n >= min_n)
}

# usage
# cal_ts <- calibration_by_period(df, period = "month", min_n = 250)
# cal_ts

3. EWMA Threshold Selection (Practical, Audit-Friendly)

3.1 Why EWMA

EWMA is suited for:

It is especially useful when calibration drift is gradual rather than abrupt, which is common in production healthcare settings (Van Calster et al. 2016).

gradual drift
small persistent shifts
early detection without “false alarm storms”

In calibration monitoring, EWMA is often preferable to Shewhart charts.

3.2 What are we charting?

You can EWMA-chart any calibration metric; common choices:

calib_int (intercept drift)
slope (confidence drift)
brier (overall probabilistic degradation)

Start simple:

intercept EWMA is the cleanest operational signal

3.3 EWMA knobs you must justify

EWMA requires:

lambda (smoothing): typically 0.05–0.3
L (control limit width): typically 2.7–3.0

Audit-friendly framing:

lambda controls memory
L controls false alarms vs missed drift

3.4 Baseline phase approach (recommended)

Choose a “stable baseline” period (pre-deployment, or first 3–6 months of stable ops). Estimate mean and SD of the metric during baseline. Set EWMA limits relative to baseline variability.

ewma_fit_from_baseline <- function(x, baseline_idx, lambda = 0.2, L = 3) {
  xb <- x[baseline_idx]
  mu0 <- mean(xb, na.rm = TRUE)
  sd0 <- sd(xb, na.rm = TRUE)

  list(mu0 = mu0, sd0 = sd0, lambda = lambda, L = L)
}

3.5 Run EWMA and flag out-of-control points

ewma_flag <- function(x, mu0, sd0, lambda = 0.2, L = 3) {
  # qcc::ewma expects a center and std.dev if provided
  chart <- qcc::ewma(
    x,
    center = mu0,
    std.dev = sd0,
    lambda = lambda,
    nsigmas = L,
    plot = FALSE
  )

  # Identify points beyond limits
  ucl <- chart$limits[,2]
  lcl <- chart$limits[,1]
  ooc <- which(x > ucl | x < lcl)

  list(chart = chart, ooc = ooc, ucl = ucl, lcl = lcl)
}

Usage:

# cal_ts <- calibration_by_period(df, "month", min_n = 250)
# x <- cal_ts$calib_int

# baseline is first K periods (document your rule)
# K <- 6
# base <- 1:K
# pars <- ewma_fit_from_baseline(x, baseline_idx = base, lambda = 0.2, L = 3)
# res <- ewma_flag(x, mu0 = pars$mu0, sd0 = pars$sd0, lambda = pars$lambda, L = pars$L)

# res$ooc

3.6 Plot with explicit triggers

plot_ewma <- function(cal_ts, metric = "calib_int", ewma_res) {
  x <- cal_ts[[metric]]

  dfp <- tibble(
    period = cal_ts$period,
    x = x,
    ucl = ewma_res$ucl,
    lcl = ewma_res$lcl,
    flag = seq_along(x) %in% ewma_res$ooc
  )

  ggplot(dfp, aes(x = period, y = x)) +
    geom_line() +
    geom_point(aes(shape = flag)) +
    geom_line(aes(y = ucl)) +
    geom_line(aes(y = lcl)) +
    labs(
      title = paste0("EWMA Monitoring: ", metric),
      x = "Period",
      y = metric
    )
}

# usage
# plot_ewma(cal_ts, "calib_int", res)

3.7 Selecting lambda and L (a defensible rule)

A simple, documentable approach:

Use lambda = 0.2 for monthly monitoring (moderate memory)
Use L = 3 to balance false alarms vs missed drift
Validate sensitivity via simulation: inject a known intercept shift and verify detection delay

This is not perfect. It is transparent and testable.

4. Governance Checklist (Calibration in Production)

4.1 What to log (minimum viable audit trail)

Per prediction batch / period:

model identifier (version / hash)
training data fingerprint
scoring data fingerprint
prediction timestamp range
number scored (N)
observed outcome N (when available)
calibration intercept and slope
Brier score (or other pre-agreed metric)
alert status (in control / warning / action)

4.2 Define action thresholds and escalation paths

Document:

what constitutes a warning vs action
who receives alerts
acceptable time-to-review
acceptable time-to-remediation

Example policy language:

Warning: EWMA out-of-control 1 period OR slope outside [0.8, 1.2]
Action: 2 consecutive OOC periods OR intercept drift beyond clinically relevant margin

4.3 Approved remediation options (pre-specify)

Remediation must be pre-specified to avoid “moving goalposts.”

Typical options:

Intercept-only recalibration (fast, transparent)
Full logistic recalibration (slope + intercept)
Stratified recalibration (by site/service if drift is localized)
Retraining (only if concept drift is demonstrated)
Temporary suspension (if drift indicates harm risk)

4.4 Documentation artifacts to retain

For each alert event:

calibration monitoring plot(s)
summary table of metrics
decision log (who decided what, when, why)
code snapshot used to compute metrics
any recalibration fit objects / parameters
post-fix validation results

Where This Shows Up in AI/ML

Calibration is the most deployment-critical and least-reported metric in clinical AI — a model with 0.85 AUC that overstates probabilities by 30% will cause clinicians to over-triage low-risk patients and exhaust resources on false alarms. FDA’s SaMD guidance explicitly calls out calibration as a required performance metric alongside discrimination, yet the majority of published clinical prediction models report only AUC. DoDTR-based trauma models deployed via MAVEN must be recalibrated when moved to a new MTF population, because calibration is far more sensitive to population shift than AUC. Skipping the reliability diagram and Brier score decomposition before deployment is not a shortcut — it is a guarantee that the model’s probability outputs will be wrong in a direction you have not characterized.

Closing Notes

Calibration is where deployed models become, in the language of modern prediction-model governance, either trustworthy or quietly unsafe (Steyerberg 2019; Osheroff et al. 2007).

Calibration is where deployed models become:

quietly unsafe, or
operationally trustworthy.

A calibration toolkit is not just analytics. It is governance made measurable.

Series Callout

Note

This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:

Bayesian Workflow Toolkit
Calibration Toolkit
Missing Data Toolkit
Rare Events Toolkit
Causal Inference Toolkit
Survival Analysis Toolkit
Prediction Modeling Toolkit
Real-World Evidence Toolkit
OMOP and Interoperability Toolkit
Trauma Registry Analytics Toolkit

Series: Toolkit

← Bayesian Workflow Toolkit (Audit-Ready) | Missing Data Toolkit (Patterns, Sensitivity Grids, Reviewer Language) →

References

Harrell, Jr., Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.

Osheroff, Jerome A., Jonathan M. Teich, Blackford Middleton, Eric B. Steen, Adam Wright, and Don E. Detmer. 2007. “A Roadmap for National Action on Clinical Decision Support.” Journal of the American Medical Informatics Association 14 (2): 141–45. https://doi.org/10.1197/jamia.M2334.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.

Van Calster, Ben, Daan Nieboer, Yvonne Vergouwe, Ben De Cock, Michael J. Pencina, and Ewout W. Steyerberg. 2016. “A Calibration Hierarchy for Risk Models Was Defined: From Utopia to Empirical Data.” Journal of Clinical Epidemiology 74: 167–76. https://doi.org/10.1016/j.jclinepi.2015.12.005.