Decoding Missing Data: MCAR vs. MAR vs. MNAR in Real-World AI

Advanced Statistics
Missing Data
An applied introduction to MCAR, MAR, and MNAR, with simulations, diagnostics, and practical missing-data reasoning for AI and healthcare analytics.
Published

September 15, 2025

Modified

June 9, 2026

Executive Summary

Missing data is one of the most common and most consequential problems in applied statistics and machine learning (Rubin 1976; Little and Rubin 2019).

It is rarely just a technical nuisance.

Missingness changes:

  • who gets represented,
  • which patterns the model learns,
  • how uncertainty should be expressed,
  • and whether downstream conclusions can be trusted.

That is why the mechanism behind missingness matters so much.

The classical framework distinguishes among three broad mechanisms first formalized in the modern missing-data literature (Rubin 1976):

  • MCAR — Missing Completely at Random
  • MAR — Missing at Random
  • MNAR — Missing Not at Random

These labels are easy to memorize and easy to misuse.

In practice, they are not just definitions. They are assumptions about how the missingness process relates to the data.

This post introduces:

  • the difference between MCAR, MAR, and MNAR,
  • why those distinctions matter for inference and machine learning,
  • how to simulate missingness scenarios,
  • how to use diagnostics such as Little’s test,
  • and how to think more clearly about missingness in predictive healthcare analytics.

Missing data is not only about absent values. It is about the process that made those values absent in the first place. ## Missing Data Is a Modeling Problem, Not Just a Cleaning Problem

A common mistake is to treat missing data as a preprocessing annoyance:

  • drop incomplete rows,
  • fill in a few values,
  • move on.

That approach can be dangerously naïve.

Missingness is often informative. It may reflect:

  • disease severity,
  • workflow breakdown,
  • dropout,
  • transfer of care,
  • differential follow-up,
  • system overload,
  • or selective measurement.

This means missingness can carry signal about both the outcome and the data-generating process.

That is why missing data should be thought of as part of the modeling problem, not just part of the cleaning pipeline. ## The Key Question Is: Why Is the Data Missing?

The most important question in missing-data analysis is not:

how many values are missing?

It is:

why are they missing?

That is the question the MCAR/MAR/MNAR framework is trying to answer.

The same percentage of missingness can have very different implications depending on the mechanism.

For example:

  • 20% missing completely at random may be mostly a power problem
  • 20% missing driven by patient deterioration may be a major bias problem
  • 20% missing driven by the unobserved value itself may make standard imputation assumptions untenable

This is why mechanism matters more than the raw percentage alone. ## MCAR Means Missingness Is Unrelated to the Data

Missing Completely at Random (MCAR) means the probability that a value is missing does not depend on:

  • the observed data,
  • or the unobserved value itself.

In other words, the missingness process is unrelated to the substantive variables in the dataset.

This is the cleanest and strongest assumption.

Under MCAR:

  • complete cases are, in a probabilistic sense, a random subset of all cases
  • complete-case analysis may remain unbiased, though less efficient
  • missingness mainly costs information rather than distorting structure

But true MCAR is often uncommon in real clinical or operational data (Little and Rubin 2019).

It is possible, but usually the most optimistic assumption. ## MAR Means Missingness Depends on Observed Data

Missing at Random (MAR) means the probability of missingness may depend on observed variables, but not on the missing value itself after conditioning on those observed variables.

That is a more realistic scenario in many applied settings.

For example:

  • older patients may be more likely to miss follow-up,
  • more severe patients may be less likely to have certain labs repeated,
  • certain hospitals may record fields less consistently than others.

Under MAR, missingness is not random in the everyday sense. It is systematic. But it is systematic in a way that can be modeled using observed information.

This is why many modern imputation methods rely on MAR as the working assumption. ## MNAR Means Missingness Depends on the Unobserved Value Itself

Missing Not at Random (MNAR) means the missingness process depends on the missing value itself, even after accounting for the observed data.

This is the hardest case.

For example:

  • patients with the worst symptom burden may be least likely to complete follow-up,
  • those with extreme lab values may be selectively unmeasured,
  • patients who are doing poorly may drop out because of the outcome being studied.

MNAR is difficult because the missingness mechanism depends on information we do not fully observe.

That means the data alone usually cannot fully identify the problem without stronger assumptions or sensitivity analysis.

This is one reason MNAR is both important and frustrating. ## A Trial Dropout Example Makes the Distinctions Concrete

To make the mechanisms more intuitive, we will simulate a clinical-trial style dataset with a continuous patient-reported outcome and then impose different missingness mechanisms.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 500

trial_df <- tibble::tibble(
  id = 1:n,
  age = rnorm(n, mean = 58, sd = 12),
  baseline_severity = rnorm(n, mean = 0, sd = 1),
  treatment = sample(c("Control", "Treatment"), size = n, replace = TRUE)
) |>
  dplyr::mutate(
    outcome = 50 - 4 * baseline_severity + 2 * (treatment == "Treatment") + rnorm(n, 0, 5)
  )

trial_df |>
  dplyr::summarise(
    mean_outcome = mean(outcome),
    sd_outcome = sd(outcome)
  )
# A tibble: 1 × 2
  mean_outcome sd_outcome
         <dbl>      <dbl>
1         51.0       6.52

Think of outcome as a follow-up score or symptom burden measure that is fully observed in the ideal dataset. ## Simulating MCAR Shows Pure Random Missingness

First, let us generate MCAR missingness by removing outcome values completely at random.

trial_mcar <- trial_df |>
  dplyr::mutate(
    outcome_mcar = if_else(runif(n) < 0.25, NA_real_, outcome)
  )

trial_mcar |>
  dplyr::summarise(
    pct_missing = mean(is.na(outcome_mcar))
  )
# A tibble: 1 × 1
  pct_missing
        <dbl>
1       0.238

Under this setup, the missingness probability is unrelated to any variable in the dataset.

That is the defining feature of MCAR. ## Simulating MAR Shows Missingness Driven by Observed Variables

Now let us create MAR missingness by making higher baseline severity increase the chance of missing follow-up.

mar_prob <- plogis(-1.3 + 1.0 * trial_df$baseline_severity)

trial_mar <- trial_df |>
  dplyr::mutate(
    mar_missing = rbinom(n, size = 1, prob = mar_prob),
    outcome_mar = if_else(mar_missing == 1, NA_real_, outcome)
  )

trial_mar |>
  dplyr::summarise(
    pct_missing = mean(is.na(outcome_mar))
  )
# A tibble: 1 × 1
  pct_missing
        <dbl>
1       0.236

Here, missingness is related to an observed variable, baseline_severity.

That makes it MAR, not MCAR.

The important point is that the missingness is systematic, but potentially modelable. ## Simulating MNAR Shows Missingness Driven by the Unobserved Outcome

Now let us create an MNAR mechanism by making missingness more likely when the outcome itself is poor.

mnar_prob <- plogis(-7 + 0.12 * trial_df$outcome)

trial_mnar <- trial_df |>
  dplyr::mutate(
    mnar_missing = rbinom(n, size = 1, prob = mnar_prob),
    outcome_mnar = if_else(mnar_missing == 1, NA_real_, outcome)
  )

trial_mnar |>
  dplyr::summarise(
    pct_missing = mean(is.na(outcome_mnar))
  )
# A tibble: 1 × 1
  pct_missing
        <dbl>
1       0.308

This is the difficult case.

The missingness now depends on the value that becomes missing.

That makes standard “observed-data-only” corrections much harder to justify. ## The Same Amount of Missingness Can Produce Different Bias

A useful way to compare the mechanisms is to see how the observed complete-case outcome mean differs from the true full-data mean.

true_mean <- mean(trial_df$outcome)

compare_tbl <- tibble::tibble(
  mechanism = c("Full Data", "MCAR", "MAR", "MNAR"),
  mean_estimate = c(
    true_mean,
    mean(trial_mcar$outcome_mcar, na.rm = TRUE),
    mean(trial_mar$outcome_mar, na.rm = TRUE),
    mean(trial_mnar$outcome_mnar, na.rm = TRUE)
  )
) |>
  dplyr::mutate(
    bias = mean_estimate - true_mean
  )

compare_tbl
# A tibble: 4 × 3
  mechanism mean_estimate   bias
  <chr>             <dbl>  <dbl>
1 Full Data          51.0  0    
2 MCAR               51.2  0.193
3 MAR                52.0  0.960
4 MNAR               49.7 -1.33 

This is one of the most important lessons in missing-data work (Little and Rubin 2019; Schafer 1997):

  • the mechanism affects bias,
  • not just the amount of missingness.

MCAR often mainly reduces efficiency. MAR and especially MNAR can distort the estimand itself. ## Visual Diagnostics Often Reveal Missingness Patterns

Before formal testing, one of the best first steps is to look at missingness patterns graphically.

For example, under MAR in our simulated data, missingness depends on baseline severity.

plot_df <- trial_mar |>
  dplyr::mutate(
    missing = is.na(outcome_mar)
  )

ggplot2::ggplot(plot_df, ggplot2::aes(x = baseline_severity, fill = missing)) +
  ggplot2::geom_histogram(bins = 30, position = "identity", alpha = 0.6) +
  ggplot2::labs(
    title = "Missingness Pattern by Baseline Severity",
    x = "Baseline Severity",
    y = "Count"
  ) +
  ggplot2::theme_minimal()

This kind of plot often gives an immediate clue that the missingness is not MCAR.

It may not prove MAR versus MNAR, but it can show clear dependence on observed variables. ## Little’s Test Is a Classical Diagnostic for MCAR

One of the best-known formal tests for MCAR is Little’s MCAR test (Little 1988).

The null hypothesis is that the data are consistent with MCAR.

This means:

  • a non-small p-value does not prove MCAR,
  • but a small p-value suggests the MCAR assumption is implausible.

In R, this is often available through packages such as BaylorEdPsych or other missing-data toolkits.

required_pkgs <- c("BaylorEdPsych")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

test_df <- trial_mar |>
  dplyr::select(age, baseline_severity, outcome_mar)

BaylorEdPsych::LittleMCAR(test_df)

Little’s test can be useful, but it should be interpreted as one diagnostic among many, not as a final oracle. ## Logistic Missingness Models Are Often More Informative Than One Test Alone

Another practical strategy is to model the missingness indicator itself.

For example, define:

  • 1 = outcome missing
  • 0 = outcome observed

and regress that missingness indicator on observed variables.

miss_model_df <- trial_mar |>
  dplyr::mutate(
    missing = as.integer(is.na(outcome_mar))
  )

fit_missing <- glm(
  missing ~ age + baseline_severity + treatment,
  data = miss_model_df,
  family = binomial()
)

summary(fit_missing)

Call:
glm(formula = missing ~ age + baseline_severity + treatment, 
    family = binomial(), data = miss_model_df)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.945059   0.583478  -1.620    0.105    
age                -0.007575   0.009783  -0.774    0.439    
baseline_severity   1.097176   0.138195   7.939 2.03e-15 ***
treatmentTreatment -0.051768   0.231286  -0.224    0.823    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 546.43  on 499  degrees of freedom
Residual deviance: 463.77  on 496  degrees of freedom
AIC: 471.77

Number of Fisher Scoring iterations: 5

If observed variables strongly predict missingness, MCAR becomes less plausible.

This does not prove MAR, but it gives a more grounded sense of what the missingness process may depend on. ## MCAR, MAR, and MNAR Are Assumptions About the Data-Generating Process

One of the most important conceptual points is that these categories are not always directly testable from the observed data alone.

MCAR can sometimes be challenged empirically. MAR and MNAR are more difficult.

In practice:

  • MAR is often a working assumption,
  • MNAR is often a sensitivity concern,
  • and the real mechanism is rarely known with certainty.

This means missing-data analysis requires both:

  • diagnostics,
  • and subject-matter reasoning.

A purely mechanical classification approach is usually too simplistic. ## Missingness Mechanisms Matter for Imputation Strategy

Why does all this matter so much in AI/ML workflows?

Because the mechanism affects what imputation or modeling strategy is defensible.

For example:

Under MCAR

Simple methods may sometimes be less harmful, though efficiency is still lost.

Under MAR

Methods like multiple imputation or model-based imputation can be appropriate if the observed predictors of missingness are included (Rubin 1987; Buuren 2018).

Under MNAR

Standard imputation under MAR may still be biased, and sensitivity analysis becomes much more important.

This is why mechanism and imputation cannot be separated. The imputation model is only as good as the missingness assumptions behind it. ## Complete-Case Analysis Is Often Easy, but Often Too Optimistic

A common default is complete-case analysis: analyze only rows without missing data.

This can be acceptable under true MCAR, but it can be problematic under MAR or MNAR.

Problems include:

  • biased estimates,
  • distorted subgroup representation,
  • reduced power,
  • and unfair exclusion of under-measured populations.

This is especially important in predictive healthcare analytics, where complete-case filtering can quietly remove exactly the patients with the most complex or clinically important trajectories.

So complete-case analysis should be viewed as a modeling choice with assumptions, not as a neutral default. ## Missingness Indicators Can Sometimes Add Predictive Value

In predictive modeling, it is sometimes useful to include an indicator that a value was missing.

For example:

trial_mar |>
  dplyr::mutate(
    outcome_missing = is.na(outcome_mar)
  ) |>
  dplyr::count(outcome_missing)
# A tibble: 2 × 2
  outcome_missing     n
  <lgl>           <int>
1 FALSE             382
2 TRUE              118

A missingness indicator does not solve the missing-data problem by itself. But it can sometimes help the model learn that the fact of missingness carries signal.

This is especially relevant in ML pipelines where missingness may reflect workflow, severity, or system context.

That said, missingness indicators should complement, not replace, careful thinking about mechanism. ## Python and R Both Support Practical Missingness Diagnostics

The missing-data workflow is not language-specific.

In R, analysts often use:

  • naniar
  • VIM
  • mice
  • BaylorEdPsych
  • Amelia

In Python, comparable workflows often involve:

  • pandas for missingness summaries
  • missingno for visualization
  • statsmodels or sklearn for missingness modeling
  • custom simulation and imputation pipelines

The core statistical questions remain the same in both ecosystems:

  • what is missing?
  • why is it missing?
  • and what assumptions are being made when we fill or ignore it? ## A Practical Checklist for Applied Work

Before choosing a missing-data strategy, ask:

  • Which variables are missing, and how often?
  • Do observed variables predict missingness?
  • Is MCAR plausible, or clearly implausible?
  • Is MAR a defensible working assumption?
  • Could the missingness plausibly depend on the unobserved value itself?
  • What would complete-case analysis exclude?
  • Does the imputation model include variables related to both missingness and the outcome?
  • Do I need sensitivity analysis for possible MNAR behavior?

These questions usually improve missing-data work more than rushing straight to imputation.

NoteWhere This Shows Up in AI/ML

In DoDTR-based mortality models, prehospital vital signs and mechanism-of-injury fields are frequently absent not at random but because they were never recorded in the chaos of point-of-injury care — a textbook MNAR scenario. A model trained without accounting for this mechanism can learn that missing prehospital data predicts survival, inverting the true clinical signal. Epic’s deterioration index faces the same problem: lab values are missing because the clinician didn’t order them, often because the patient appeared stable, creating a systematic association between missingness and perceived low risk. When the missingness mechanism is misclassified as MCAR and treated as ignorable, every downstream imputation, weighting, and inference step is built on a false foundation.

Closing: Missing Data Is About Mechanism, Not Just Percentage

Missing data analysis starts to improve when we stop asking only how much data are missing and start asking why the values are missing.

MCAR, MAR, and MNAR are not just vocabulary terms. They are different assumptions about the missingness process, and those assumptions shape what kind of inference or prediction is defensible.

In real-world AI and healthcare analytics, this matters because missingness is often informative. It reflects people, processes, systems, and outcomes — not just blank cells in a table.

Missing-data mechanisms matter because absent values still carry information, and the process that creates missingness can bias a model just as much as the values themselves.

Tip📚 Go Deeper: Missing Data Toolkit

This post is part of the Missing Data Toolkit — a companion reference with missingness mechanism diagnostics, Little’s MCAR test templates, and workflow scaffolds for handling MCAR, MAR, and MNAR in clinical data.

→ Open the Missing Data Toolkit

References

Buuren, Stef van. 2018. Flexible Imputation of Missing Data. 2nd ed. Chapman; Hall/CRC. https://doi.org/10.1201/9780429492259.
Little, Roderick J. A. 1988. “A Test of Missing Completely at Random for Multivariate Data with Missing Values.” Journal of the American Statistical Association 83 (404): 1198–202. https://doi.org/10.1080/01621459.1988.10478722.
Little, Roderick J. A., and Donald B. Rubin. 2019. Statistical Analysis with Missing Data. 3rd ed. Wiley. https://doi.org/10.1002/9781119482260.
Rubin, Donald B. 1976. “Inference and Missing Data.” Biometrika 63 (3): 581–92. https://doi.org/10.1093/biomet/63.3.581.
Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. Wiley. https://doi.org/10.1002/9780470316696.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. Chapman; Hall/CRC.