Missing Data Is the Real Model: Practical Strategies in R

Trauma Registry and Other Topics
Missing Data
R
How to think honestly about missing data in clinical and registry analyses, with practical strategies in R for visualization, imputation, and sensitivity analysis.
Published

February 1, 2024

Modified

June 9, 2026

Executive Summary

Most applied models fail before modeling even begins.

They fail because missing data is treated as:

  • an inconvenience,
  • a technical nuisance,
  • or something to be “cleaned away.”

In reality, missing data often encodes:

  • workflow constraints,
  • illness severity,
  • documentation bias,
  • access to care,
  • and system design.

Ignoring missingness does not simplify the problem.
It changes the estimand—often without you realizing it (Little and Rubin 2019; Sterne et al. 2009; Buuren 2018).


1. The Dangerous Comfort of Complete-Case Analysis

The most common missing-data strategy in applied work is also the most damaging:

“Just drop rows with missing values.”

This approach assumes:

  • missingness is random,
  • removed observations are interchangeable with retained ones,
  • and conclusions remain valid.

In clinical and operational data, these assumptions are almost never true (Sterne et al. 2009; Little and Rubin 2019).


2. Missingness Is Information, Not Noise

In real clinical systems, data is missing for reasons.

Examples:

  • Labs missing because the patient was too unstable
  • Vitals missing because care escalated
  • Documentation missing because time ran out
  • Fields missing because they were irrelevant in that context

Each of these tells a different story.


3. The Three Missingness Mechanisms (Plain English)

You don’t need equations—just clarity.

3.1 MCAR: Missing Completely at Random

Rare in practice.

“The data is missing for no reason related to the patient or outcome.”

3.2 MAR: Missing at Random

Common, manageable, but still dangerous if ignored.

“Missingness depends on observed data.”

3.3 MNAR: Missing Not at Random

Common in clinical data—and the hardest case.

“Missingness depends on unobserved data or severity.”

Most real-world datasets contain all three simultaneously, which is why mechanism should be argued from workflow and subject-matter knowledge rather than assumed from convenience alone (Little and Rubin 2019; Carpenter et al. 2021).


4. First Rule: Visualize Missingness Before Modeling

Before fitting anything, ask:

  • What’s missing?
  • Where?
  • For whom?
  • When?
library(naniar)   # @tierney2023_naniar
library(ggplot2)

vis_miss(data)

This plot alone often reveals:

  • documentation patterns,
  • site-level differences,
  • time-dependent missingness,
  • variables that should never be missing (but are).

5. Missingness as a Predictor (Yes, Really)

One of the most overlooked techniques:

Explicitly model missingness.

data <- data |>
  dplyr::mutate(
    lactate_missing = is.na(lactate)
  )

This acknowledges:

  • missingness may be prognostic,
  • absence of measurement is itself information.

This is not a hack. It’s honesty.


6. Why “Mean Imputation” Is Usually Wrong

Simple imputation strategies:

  • distort distributions,
  • shrink variance,
  • bias associations,
  • inflate confidence.

They create fictional certainty.

If you impute, do it deliberately—or not at all.


7. Multiple Imputation (When It Helps—and When It Doesn’t)

Multiple imputation can be powerful if used correctly (Sterne et al. 2009; Buuren 2018).

library(mice)

imp <- mice(
  data,
  m = 5,
  method = "pmm",
  seed = 20260125
)

completed <- complete(imp, action = "long")

When it helps:

  • MAR is plausible
  • Covariates explain missingness
  • You propagate uncertainty

When it fails:

  • MNAR dominates
  • Imputation model is misspecified
  • Reviewers don’t know what you assumed

8. Sensitivity Analysis Is Not Optional

An audit-ready analysis asks:

“If our missingness assumption is wrong, how wrong could we be?”

Practical sensitivity approaches:

  • compare complete-case vs imputed
  • vary imputation models
  • include missingness indicators
  • bound estimates under worst cases

You don’t need many. You need the right ones.


9. Bayesian Models Handle Missingness Naturally

Bayesian models shine here because they can represent uncertainty jointly rather than treating imputation as a disconnected preprocessing step (Gelman et al. 2013; Carpenter et al. 2021).

Bayesian models shine here because:

  • uncertainty is explicit,
  • missing values can be modeled jointly,
  • inference remains coherent.
library(brms)

fit <- brm(
  bf(outcome ~ predictor + mi(lactate)) +
    bf(lactate | mi() ~ predictor),
  data = data,
  family = bernoulli()
)

This does not “fill in” missing data. It propagates uncertainty correctly.


10. Missing Data Changes the Question You’re Answering

This is the most important conceptual point.

When you drop missing data, you are no longer answering:

“What is the effect in the population?”

You are answering:

“What is the effect among people with complete documentation under this workflow?”

That may be fine—but it must be explicit.


11. Documentation Is Part of the Model

Audit-ready missing data handling includes:

  • stated assumptions (MCAR/MAR/MNAR)
  • justification for chosen strategy
  • counts of excluded observations
  • sensitivity results
  • acknowledgment of limitations
summary(is.na(data))

Transparency beats false precision every time.


12. A Practical Missing-Data Workflow

Before modeling

  • Visualize missingness
  • Identify structural vs accidental gaps

During modeling

  • Avoid silent deletion
  • Model missingness when plausible
  • Propagate uncertainty

After modeling

  • Run sensitivity checks
  • Document assumptions clearly
  • Report who was excluded and why

NoteWhere This Shows Up in AI/ML

Imputation strategy is a modeling decision that propagates through every downstream analysis — a DoDTR-based mortality model trained with mean imputation and validated on multiply-imputed data will show artificially inflated performance because the validation set has better effective data quality than the training set. Production trauma AI models deployed in MAVEN must use the same imputation pipeline at training and inference time, or the deployment environment will encounter a different data distribution than the model was validated on. This is especially acute in forward operating environments where prehospital documentation is sparse and missingness patterns differ sharply from garrison-based registry data. A model that cannot handle its own missing data pattern in production is not a deployed model — it is a liability.

Closing: The Real Model Is the System

Missing data reflects:

  • how care is delivered,
  • how clinicians document,
  • how systems are stressed,
  • and who gets measured.

Treating missingness as a nuisance hides the system.

Treating it as part of the model reveals reality.


Tip📚 Go Deeper: Missing Data Toolkit

This post is part of the Missing Data Toolkit — a companion reference with missingness pattern diagnostics, multiple imputation templates, and sensitivity analysis grids.

→ Open the Missing Data Toolkit


Series Callout

Note

This post is part of a broader Trauma Registry and Other Topics Series:

  • Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
  • Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
  • Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
  • Missing Data Is the Real Model: Practical Strategies in R
  • From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
  • Why Statistical Significance Is a Terrible Stopping Rule
  • Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
  • Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
  • How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
  • Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
  • Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
  • Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
  • Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
  • Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
  • MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality

References

Buuren, Stef van. 2018. Flexible Imputation of Missing Data. 2nd ed. Chapman; Hall/CRC.
Carpenter, James R., Melanie Smuk, Tim P. Morris, and Michael G. Kenward. 2021. “Missing Data: A Statistical Framework for Practice.” Biometrical Journal 63 (5): 915–47. https://doi.org/10.1002/bimj.202000196.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. Chapman; Hall/CRC.
Little, Roderick J. A., and Donald B. Rubin. 2019. Statistical Analysis with Missing Data. 3rd ed. Wiley.
Sterne, Jonathan A. C., Ian R. White, John B. Carlin, et al. 2009. “Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls.” BMJ 338: b2393. https://doi.org/10.1136/bmj.b2393.