Calibration Toolkit (Slope, EWMA, Governance)
Executive Summary
This appendix is a reusable calibration toolkit designed for deployed clinical prediction models.
It includes:
- Calibration slope and intercept code (calibration-in-the-large and confidence drift)
- EWMA threshold selection (how to set monitoring sensitivity without guesswork theater)
- Governance checklist (what to document, log, and trigger when drift is detected)
Calibration is not a one-time evaluation. It is an operational obligation, especially for clinical prediction models that may drift over time or across settings (Steyerberg 2019; Van Calster et al. 2016).
1. Setup
library(dplyr)
library(tibble)
library(ggplot2)
library(qcc)
set.seed(20231101)Assume you have production scoring logs (or evaluation data) with:
y(0/1 outcome)p_hatpredicted probability (0–1)score_datedate/time of prediction- optional grouping fields (
site,service,unit, etc.)
# df <- readRDS("data_processed/prediction_log.rds")
# Required columns: y, p_hat, score_date2. Calibration Slope and Intercept Code
2.1 Why slope and intercept matter
A model can keep stable AUROC while calibration deteriorates, which is one reason discrimination alone is insufficient for deployment monitoring (Harrell 2015; Steyerberg 2019). A model can keep stable AUROC while:
- Intercept drifts (systematically too high/low)
- Slope drifts (overconfident or underconfident)
You should monitor both.
2.2 Helper: safe logit transform
safe_logit <- function(p, eps = 1e-6) {
p2 <- pmin(pmax(p, eps), 1 - eps)
qlogis(p2)
}2.3 Calibration-in-the-large (intercept only)
This estimates the additive shift needed to correct systematic bias while assuming the existing linear predictor is otherwise correct.
Model: \[ \text{logit}(P(Y=1)) = \alpha + \text{offset}(\text{logit}(\hat p)) \]
calibration_intercept <- function(y01, p_hat) {
lp <- safe_logit(p_hat)
fit <- glm(y01 ~ offset(lp), family = binomial())
unname(coef(fit)[1])
}Interpretation:
0means calibrated-in-the-large- positive means true risk > predicted
- negative means predicted risk too high
2.4 Calibration slope + intercept (logistic recalibration)
Model: \[ \text{logit}(P(Y=1)) = \alpha + \beta \cdot \text{logit}(\hat p) \]
calibration_slope_intercept <- function(y01, p_hat) {
lp <- safe_logit(p_hat)
fit <- glm(y01 ~ lp, family = binomial())
tibble(
intercept = unname(coef(fit)[1]),
slope = unname(coef(fit)[2])
)
}Interpretation of slope:
~1is ideal<1means predictions are too extreme (overconfident)>1means predictions are too timid (underconfident)
2.5 Windowed calibration metrics (monthly / weekly monitoring)
This converts calibration into a time series that SPC can monitor.
calibration_by_period <- function(df, period = "month", min_n = 200) {
df %>%
mutate(period = as.Date(cut(score_date, period))) %>%
group_by(period) %>%
summarise(
n = n(),
event_rate = mean(y, na.rm = TRUE),
brier = mean((y - p_hat)^2, na.rm = TRUE),
calib_int = calibration_intercept(y, p_hat),
slope_int = list(calibration_slope_intercept(y, p_hat)),
.groups = "drop"
) %>%
tidyr::unnest_wider(slope_int) %>%
filter(n >= min_n)
}
# usage
# cal_ts <- calibration_by_period(df, period = "month", min_n = 250)
# cal_ts3. EWMA Threshold Selection (Practical, Audit-Friendly)
3.1 Why EWMA
EWMA is suited for:
It is especially useful when calibration drift is gradual rather than abrupt, which is common in production healthcare settings (Van Calster et al. 2016).
- gradual drift
- small persistent shifts
- early detection without “false alarm storms”
In calibration monitoring, EWMA is often preferable to Shewhart charts.
3.2 What are we charting?
You can EWMA-chart any calibration metric; common choices:
calib_int(intercept drift)slope(confidence drift)brier(overall probabilistic degradation)
Start simple:
- intercept EWMA is the cleanest operational signal
3.3 EWMA knobs you must justify
EWMA requires:
lambda(smoothing): typically 0.05–0.3L(control limit width): typically 2.7–3.0
Audit-friendly framing:
lambdacontrols memoryLcontrols false alarms vs missed drift
3.4 Baseline phase approach (recommended)
Choose a “stable baseline” period (pre-deployment, or first 3–6 months of stable ops). Estimate mean and SD of the metric during baseline. Set EWMA limits relative to baseline variability.
ewma_fit_from_baseline <- function(x, baseline_idx, lambda = 0.2, L = 3) {
xb <- x[baseline_idx]
mu0 <- mean(xb, na.rm = TRUE)
sd0 <- sd(xb, na.rm = TRUE)
list(mu0 = mu0, sd0 = sd0, lambda = lambda, L = L)
}3.5 Run EWMA and flag out-of-control points
ewma_flag <- function(x, mu0, sd0, lambda = 0.2, L = 3) {
# qcc::ewma expects a center and std.dev if provided
chart <- qcc::ewma(
x,
center = mu0,
std.dev = sd0,
lambda = lambda,
nsigmas = L,
plot = FALSE
)
# Identify points beyond limits
ucl <- chart$limits[,2]
lcl <- chart$limits[,1]
ooc <- which(x > ucl | x < lcl)
list(chart = chart, ooc = ooc, ucl = ucl, lcl = lcl)
}Usage:
# cal_ts <- calibration_by_period(df, "month", min_n = 250)
# x <- cal_ts$calib_int
# baseline is first K periods (document your rule)
# K <- 6
# base <- 1:K
# pars <- ewma_fit_from_baseline(x, baseline_idx = base, lambda = 0.2, L = 3)
# res <- ewma_flag(x, mu0 = pars$mu0, sd0 = pars$sd0, lambda = pars$lambda, L = pars$L)
# res$ooc3.6 Plot with explicit triggers
plot_ewma <- function(cal_ts, metric = "calib_int", ewma_res) {
x <- cal_ts[[metric]]
dfp <- tibble(
period = cal_ts$period,
x = x,
ucl = ewma_res$ucl,
lcl = ewma_res$lcl,
flag = seq_along(x) %in% ewma_res$ooc
)
ggplot(dfp, aes(x = period, y = x)) +
geom_line() +
geom_point(aes(shape = flag)) +
geom_line(aes(y = ucl)) +
geom_line(aes(y = lcl)) +
labs(
title = paste0("EWMA Monitoring: ", metric),
x = "Period",
y = metric
)
}
# usage
# plot_ewma(cal_ts, "calib_int", res)3.7 Selecting lambda and L (a defensible rule)
A simple, documentable approach:
- Use
lambda = 0.2for monthly monitoring (moderate memory) - Use
L = 3to balance false alarms vs missed drift - Validate sensitivity via simulation: inject a known intercept shift and verify detection delay
This is not perfect. It is transparent and testable.
4. Governance Checklist (Calibration in Production)
4.1 What to log (minimum viable audit trail)
Per prediction batch / period:
- model identifier (version / hash)
- training data fingerprint
- scoring data fingerprint
- prediction timestamp range
- number scored (N)
- observed outcome N (when available)
- calibration intercept and slope
- Brier score (or other pre-agreed metric)
- alert status (in control / warning / action)
4.2 Define action thresholds and escalation paths
Document:
- what constitutes a warning vs action
- who receives alerts
- acceptable time-to-review
- acceptable time-to-remediation
Example policy language:
- Warning: EWMA out-of-control 1 period OR slope outside [0.8, 1.2]
- Action: 2 consecutive OOC periods OR intercept drift beyond clinically relevant margin
4.3 Approved remediation options (pre-specify)
Remediation must be pre-specified to avoid “moving goalposts.”
Typical options:
- Intercept-only recalibration (fast, transparent)
- Full logistic recalibration (slope + intercept)
- Stratified recalibration (by site/service if drift is localized)
- Retraining (only if concept drift is demonstrated)
- Temporary suspension (if drift indicates harm risk)
4.4 Documentation artifacts to retain
For each alert event:
- calibration monitoring plot(s)
- summary table of metrics
- decision log (who decided what, when, why)
- code snapshot used to compute metrics
- any recalibration fit objects / parameters
- post-fix validation results
Calibration is the most deployment-critical and least-reported metric in clinical AI — a model with 0.85 AUC that overstates probabilities by 30% will cause clinicians to over-triage low-risk patients and exhaust resources on false alarms. FDA’s SaMD guidance explicitly calls out calibration as a required performance metric alongside discrimination, yet the majority of published clinical prediction models report only AUC. DoDTR-based trauma models deployed via MAVEN must be recalibrated when moved to a new MTF population, because calibration is far more sensitive to population shift than AUC. Skipping the reliability diagram and Brier score decomposition before deployment is not a shortcut — it is a guarantee that the model’s probability outputs will be wrong in a direction you have not characterized.
Closing Notes
Calibration is where deployed models become, in the language of modern prediction-model governance, either trustworthy or quietly unsafe (Steyerberg 2019; Osheroff et al. 2007).
Calibration is where deployed models become:
- quietly unsafe, or
- operationally trustworthy.
A calibration toolkit is not just analytics. It is governance made measurable.
Series Callout
This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:
- Bayesian Workflow Toolkit
- Calibration Toolkit
- Missing Data Toolkit
- Rare Events Toolkit
- Causal Inference Toolkit
- Survival Analysis Toolkit
- Prediction Modeling Toolkit
- Real-World Evidence Toolkit
- OMOP and Interoperability Toolkit
- Trauma Registry Analytics Toolkit