Missing Data Toolkit (Patterns, Sensitivity Grids, Reviewer Language)
Executive Summary
This toolkit supports audit-ready missing data work by making three things explicit:
- What is missing, where, and for whom (pattern tables)
- How conclusions change under reasonable alternative assumptions (sensitivity grids)
- How to communicate limitations without hand-waving (reviewer-facing language)
The goal is not to eliminate missingness. The goal is to make missingness legible through transparent description, principled assumptions, and sensitivity analysis (Little and Rubin 2019; Buuren 2018; Carpenter et al. 2021).
1. Setup and Conventions
Assume an analysis dataset named data with an outcome outcome (0/1 or factor), plus predictors.
# Example:
# data <- readRDS("data_processed/analysis_df.rds")
stopifnot(exists("data"))
# Standardize outcome to 0/1 numeric where needed
y01 <- function(x) {
if (is.factor(x)) return(as.integer(as.character(x)))
as.integer(x)
}2. Missingness Pattern Tables
2.1 Variable-level missingness summary
library(dplyr)
library(tidyr)
library(tibble)
missingness_summary <- function(df) {
tibble::tibble(
variable = names(df),
n = nrow(df),
n_missing = vapply(df, function(x) sum(is.na(x)), integer(1)),
pct_missing = 100 * n_missing / n
) %>%
arrange(desc(pct_missing))
}
miss_tbl <- missingness_summary(data)
miss_tblOptional: focus on variables above a missingness threshold.
miss_tbl %>%
filter(pct_missing >= 5)2.2 Missingness by subgroup (site/era/service)
This catches workflow-driven missingness, which is often more informative than overall percentages alone in clinical and registry data (Little and Rubin 2019; Carpenter et al. 2021).
missingness_by_group <- function(df, group_var, vars) {
df %>%
group_by(.data[[group_var]]) %>%
summarise(
n = n(),
across(all_of(vars), ~ mean(is.na(.x)), .names = "miss_{.col}"),
.groups = "drop"
) %>%
mutate(across(starts_with("miss_"), ~ round(100 * .x, 1)))
}
# Example usage:
# missingness_by_group(data, group_var = "site", vars = c("lactate", "sbp", "gcs"))2.3 Pairwise missingness (co-missingness heat map)
Useful for diagnosing “blocks” of documentation.
library(ggplot2)
co_missing_matrix <- function(df, vars) {
m <- as.data.frame(lapply(df[vars], is.na))
cm <- cor(as.matrix(m), use = "pairwise.complete.obs")
cm
}
plot_co_missing <- function(cm) {
dfm <- as.data.frame(as.table(cm))
names(dfm) <- c("var1", "var2", "cor_missing")
ggplot(dfm, aes(x = var1, y = var2, fill = cor_missing)) +
geom_tile() +
labs(title = "Co-missingness correlation", x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
# Example usage:
# vars <- c("lactate","sbp","dbp","gcs_total","base_deficit")
# cm <- co_missing_matrix(data, vars)
# plot_co_missing(cm)2.4 Missingness pattern table (top patterns)
This is the “what combinations are we actually seeing?” table.
missingness_patterns <- function(df, vars, top_n = 15) {
pat <- df %>%
transmute(across(all_of(vars), ~ as.integer(is.na(.x)))) %>%
unite("pattern", everything(), sep = "") %>%
count(pattern, sort = TRUE) %>%
mutate(pct = round(100 * n / sum(n), 2)) %>%
slice_head(n = top_n)
pat
}
# Example usage:
# missingness_patterns(data, vars = c("lactate","base_deficit","sbp","gcs_total"))3. Example Sensitivity Grids
3.1 Sensitivity grid: complete-case vs missingness-indicator vs multiple imputation
This grid forces you to answer: “Are my conclusions stable across reasonable strategies?” That framing is consistent with modern recommendations to compare assumptions rather than treat one missing-data strategy as automatically correct (Buuren 2018; Sterne et al. 2009).
Model family: logistic regression example (adjust as needed).
library(broom)
fit_cc <- function(df, formula, vars_required) {
df_cc <- df %>% tidyr::drop_na(all_of(vars_required))
glm(formula, data = df_cc, family = binomial())
}
fit_missing_indicator <- function(df, formula_base, var_with_missing) {
df2 <- df %>%
mutate(
miss_ind = is.na(.data[[var_with_missing]]),
var_imp = ifelse(is.na(.data[[var_with_missing]]),
median(.data[[var_with_missing]], na.rm = TRUE),
.data[[var_with_missing]])
)
# you can rename in the formula to use var_imp + miss_ind
glm(formula_base, data = df2, family = binomial())
}
summarize_fit <- function(fit, label) {
broom::tidy(fit) %>%
mutate(model = label)
}Example grid run (edit variable names/formulas to your dataset):
# Define your model and key variables
# outcome must be 0/1 numeric or factor convertible
# Example:
# formula_base <- outcome ~ age + severity + var_imp + miss_ind
# For complete-case:
# formula_cc <- outcome ~ age + severity + lactate
# vars_required <- c("outcome","age","severity","lactate")
# Uncomment and tailor:
# f1 <- fit_cc(data, formula_cc, vars_required)
# f2 <- fit_missing_indicator(data, formula_base, var_with_missing = "lactate")
# bind_rows(
# summarize_fit(f1, "complete_case"),
# summarize_fit(f2, "missing_indicator")
# )3.2 Multiple imputation sensitivity grid (MI settings)
Instead of a single MI run, vary key knobs:
- number of imputations
m - method (e.g.,
pmm,logreg) - inclusion/exclusion of certain predictors in the imputation model
# Optional template (requires mice)
# library(mice)
mi_run <- function(df, m = 5, method = "pmm", seed = 20260125) {
mice::mice(df, m = m, method = method, seed = seed, printFlag = FALSE)
}
mi_fit_pool <- function(imp, formula) {
fits <- with(imp, glm(formula, family = binomial()))
mice::pool(fits)
}
# Example usage:
# imp1 <- mi_run(data, m = 5, method = "pmm")
# imp2 <- mi_run(data, m = 20, method = "pmm")
# imp3 <- mi_run(data, m = 20, method = "norm") # if appropriate for numeric covariates
#
# pool1 <- mi_fit_pool(imp1, outcome ~ age + severity + lactate)
# summary(pool1)3.3 MNAR “delta adjustment” sensitivity (simple, reviewer-friendly)
This is a practical pattern for “what if missing values are systematically higher/lower?”
For a continuous variable x with missingness, define:
- imputed baseline (e.g., median or MI mean)
- then shift missing values by
delta(in clinically meaningful units)
delta_adjustment_grid <- function(df, outcome, x, covars, deltas) {
# df: data.frame
# outcome: string
# x: string for a numeric covariate with missingness
# covars: character vector of other covariates
# deltas: numeric vector (e.g., c(-2, -1, 0, 1, 2))
stopifnot(x %in% names(df), outcome %in% names(df))
x_med <- median(df[[x]], na.rm = TRUE)
res <- purrr::map_dfr(deltas, function(d) {
df2 <- df %>%
mutate(
x_adj = ifelse(is.na(.data[[x]]), x_med + d, .data[[x]])
)
fml <- as.formula(
paste(outcome, "~", paste(c(covars, "x_adj"), collapse = " + "))
)
fit <- glm(fml, data = df2, family = binomial())
broom::tidy(fit) %>%
filter(term == "x_adj") %>%
mutate(delta = d)
})
res
}
# Example usage:
# delta_adjustment_grid(
# data,
# outcome = "outcome",
# x = "lactate",
# covars = c("age", "severity"),
# deltas = c(-2, -1, 0, 1, 2)
# )This yields a compact story:
- if the effect direction/magnitude flips under small deltas, your conclusion is fragile
- if it’s stable, you’ve earned more confidence
4. Reviewer-Facing Language for Limitations Sections
4.1 Short template (drop-in paragraph)
Missing data were not treated as a nuisance to be removed. We first characterized missingness patterns across variables and subgroups to distinguish structural from workflow-driven gaps. Our primary analysis avoided silent deletion and reported the effective sample size used for each model. Because missingness mechanisms cannot be fully verified, we performed prespecified sensitivity analyses comparing complete-case results with alternative strategies (missingness indicators and multiple imputation where appropriate). Conclusions were evaluated for stability across these scenarios; where results were sensitive to missingness assumptions, we report this explicitly and interpret effect estimates with appropriate caution.
4.2 MAR language (when MI is used)
Multiple imputation was performed under a Missing At Random (MAR) assumption conditional on observed covariates included in the imputation model. This assumption is not empirically testable; therefore, we evaluated robustness by varying the imputation specification and comparing results to complete-case and missingness-indicator analyses. Consistency across approaches supports interpretability; divergence indicates sensitivity to missingness assumptions and is reported as a limitation (Rubin 1987; Buuren 2018; Sterne et al. 2009).
4.3 MNAR language (when you suspect severity-driven missingness)
We expect Missing Not At Random (MNAR) mechanisms for select variables (e.g., measurements omitted during time-critical escalation). Because MNAR cannot be resolved by standard imputation alone, we performed sensitivity analyses that explicitly vary plausible values for missing measurements (delta-adjustment). These analyses quantify how conclusions would change under clinically plausible departures from MAR and are used to bound interpretation rather than claim certainty (Little 1993; Carpenter et al. 2021).
Closing Notes
A defensible missing-data workflow does not promise certainty.
It delivers:
- clear missingness characterization,
- prespecified sensitivity checks,
- and language that makes assumptions reviewable.
That is what audit-ready looks like.
Series Callout
This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:
- Bayesian Workflow Toolkit
- Calibration Toolkit
- Missing Data Toolkit
- Rare Events Toolkit
- Causal Inference Toolkit
- Survival Analysis Toolkit
- Prediction Modeling Toolkit
- Real-World Evidence Toolkit
- OMOP and Interoperability Toolkit
- Trauma Registry Analytics Toolkit