Causal Inference Toolkit (DAGs, Propensity Scores, Sensitivity)

Toolkit
Causal Inference Toolkit
A practical toolkit for observational causal analysis, including estimand framing, DAG-guided adjustment, propensity score diagnostics, standardized balance checks, and sensitivity analysis prompts.
Published

April 1, 2026

Modified

June 9, 2026

Executive Summary

This toolkit is a reusable framework for causal analysis in observational data.

It includes:

  • estimand framing
  • DAG-oriented covariate selection prompts
  • propensity score workflow templates
  • balance diagnostics
  • weighting and matching scaffolds
  • sensitivity-analysis prompts
  • reviewer-facing language for transparent reporting

The goal is not to make observational data behave like randomized data by assertion. The goal is to make assumptions explicit, adjustment strategies legible, and conclusions more defensible (Hernán and Robins 2020; Pearl 2009; Rosenbaum and Rubin 1983).


1. Start With the Causal Question

Before fitting any model, define the causal question in plain language.

A useful minimum structure is:

  • treatment/exposure: what is being compared?
  • outcome: what endpoint matters?
  • target population: to whom does the effect apply?
  • time zero: when does follow-up begin?
  • estimand: what causal contrast is intended?

Without that structure, many “causal” analyses are really predictive or descriptive analyses wearing causal language (Hernán and Robins 2020).


2. Define the Estimand Explicitly

For example, a simple average treatment effect can be written as:

\[ ATE = E\{Y(1) - Y(0)\} \]

An average treatment effect in the treated can be written as:

\[ ATT = E\{Y(1) - Y(0) \mid A=1\} \]

The first question in a toolkit workflow is not “which estimator should I use?” but “which estimand am I targeting?” (Hernán and Robins 2020; Rubin 1974).


3. DAG-Oriented Covariate Selection Prompts

A directed acyclic graph is a thinking tool for deciding which variables are plausible confounders, mediators, colliders, or proxies (Pearl 2009).

Before adjustment, document each candidate variable under one of these labels:

  • likely confounder
  • likely mediator
  • possible collider
  • exposure proxy
  • outcome proxy
  • precision variable only

A simple worksheet table is often enough.

dag_adjustment_table <- tibble::tribble(
  ~variable, ~role, ~include_in_primary_adjustment, ~rationale,
  "age", "likely confounder", TRUE, "Affects both treatment assignment and outcome risk",
  "injury_severity", "likely confounder", TRUE, "Strong driver of treatment and outcome",
  "post_treatment_lab", "likely mediator", FALSE, "Measured after treatment initiation",
  "consult_service", "possible collider", FALSE, "May reflect downstream care pathway"
)

dag_adjustment_table
# A tibble: 4 × 4
  variable           role              include_in_primary_adjustment rationale  
  <chr>              <chr>             <lgl>                         <chr>      
1 age                likely confounder TRUE                          Affects bo…
2 injury_severity    likely confounder TRUE                          Strong dri…
3 post_treatment_lab likely mediator   FALSE                         Measured a…
4 consult_service    possible collider FALSE                         May reflec…

4. Baseline Propensity Score Template

Propensity scores estimate:

\[ e(X) = P(A = 1 \mid X) \]

where \(A\) is treatment and \(X\) is the covariate vector (Rosenbaum and Rubin 1983).

# Example:
# ps_fit <- stats::glm(
#   treatment ~ age + severity + comorbidity + mechanism,
#   data = data,
#   family = stats::binomial()
# )
#
# data$ps <- stats::predict(ps_fit, type = "response")

The purpose of the propensity score is not prediction quality by itself. Its purpose is to help achieve covariate balance relevant to the causal question (Austin 2011).


5. Standardized Mean Difference Diagnostics

Covariate balance should be evaluated with standardized differences rather than significance tests (Austin 2009).

std_diff_numeric <- function(x, z) {
  m1 <- mean(x[z == 1], na.rm = TRUE)
  m0 <- mean(x[z == 0], na.rm = TRUE)
  s1 <- stats::var(x[z == 1], na.rm = TRUE)
  s0 <- stats::var(x[z == 0], na.rm = TRUE)
  (m1 - m0) / sqrt((s1 + s0) / 2)
}

# Example:
# std_diff_numeric(data$age, data$treatment)

A standardized difference near zero is better evidence of balance than a non-significant p-value, especially in large or uneven samples.


6. Inverse Probability Weighting Template

For an ATE-style analysis, common unstabilized weights are:

\[ w_i = \frac{A_i}{e(X_i)} + \frac{1 - A_i}{1 - e(X_i)} \]

make_ate_weights <- function(treatment, ps, trim = NULL) {
  w <- ifelse(treatment == 1, 1 / ps, 1 / (1 - ps))

  if (!is.null(trim)) {
    lo <- stats::quantile(w, probs = trim, na.rm = TRUE)
    hi <- stats::quantile(w, probs = 1 - trim, na.rm = TRUE)
    w <- pmin(pmax(w, lo), hi)
  }

  w
}

# Example:
# data$w_ate <- make_ate_weights(data$treatment, data$ps, trim = 0.01)

Weights should always be examined for positivity problems and extreme values before outcome modeling (Hernán and Robins 2020).


7. Outcome Modeling After Weighting

# Example:
# fit_weighted <- stats::glm(
#   outcome ~ treatment,
#   data = data,
#   family = stats::binomial(),
#   weights = w_ate
# )
#
# summary(fit_weighted)

This simple model is not the whole causal argument. It is the last step in a workflow that depends on correct time alignment, reasonable covariate selection, adequate overlap, and diagnostic transparency.


8. Matching Template

Some applications are better served by matching than by weighting.

required_pkgs <- c("MatchIt")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]
if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

m_out <- MatchIt::matchit(
  treatment ~ age + severity + comorbidity + mechanism,
  data = data,
  method = "nearest",
  distance = "glm"
)

matched_data <- MatchIt::match.data(m_out)

The right matching strategy depends on the estimand, overlap, and tolerance for bias-variance tradeoffs.


9. Sensitivity Analysis Prompts

A causal workflow should always state what could still go wrong after adjustment.

Minimum prompts:

  • Could an important confounder remain unmeasured?
  • Could treatment timing create immortal time bias?
  • Could post-treatment variables have entered the adjustment set?
  • Could selection into the analytic sample distort the estimand?
  • Is the effect likely transportable to the target population?

When appropriate, an E-value can be used as one sensitivity summary for unmeasured confounding (VanderWeele and Ding 2017).


10. Reviewer-Facing Reporting Language

Use language like this in methods or appendices:

The primary analysis was designed to estimate a causal contrast from observational data rather than to optimize prediction. We first defined the target estimand and time zero, then selected adjustment variables based on subject-matter reasoning and DAG-oriented thinking. Propensity score methods were used to improve measured covariate balance, and balance was assessed with standardized differences rather than significance tests (Hernán and Robins 2020; Rosenbaum and Rubin 1983; Austin 2009).


11. Minimum Reporting Checklist

  • treatment, outcome, target population, and time zero defined
  • estimand stated explicitly
  • covariate roles documented
  • positivity and overlap considered
  • balance diagnostics shown
  • weighting or matching decisions justified
  • residual bias and sensitivity limits discussed
  • predictive claims separated from causal claims

NoteWhere This Shows Up in AI/ML

The single most important methodological distinction in military health AI is between a model built to predict outcomes and a model built to inform interventions — the former requires only predictive validity, the latter requires causal identification. MAVEN decision support tools that recommend treatment protocols based on predicted outcome improvements are implicitly causal claims, and deploying them without causal validation is the intervention fallacy at scale. The target trial emulation framework, applied to DoDTR linked with MHS GENESIS, is the rigorous path to intervention-informing evidence from observational military health data. A prediction model embedded in a clinical workflow without causal grounding will optimize for the observable patterns in its training data — including patterns driven by physician judgment, resource availability, and documentation practice — not for the treatment effects that actually improve outcomes.

12. Closing

Causal inference in observational data is not a matter of choosing a fashionable estimator. It is a matter of disciplined design logic, transparent assumptions, and careful diagnostics.

A good causal toolkit does not promise certainty. It makes the causal argument easier to inspect.

Series Callout

Note

This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:

  • Bayesian Workflow Toolkit
  • Calibration Toolkit
  • Missing Data Toolkit
  • Rare Events Toolkit
  • Causal Inference Toolkit
  • Survival Analysis Toolkit
  • Prediction Modeling Toolkit
  • Real-World Evidence Toolkit
  • OMOP and Interoperability Toolkit
  • Trauma Registry Analytics Toolkit

References

Austin, Peter C. 2009. “Balance Diagnostics for Comparing the Distribution of Baseline Covariates Between Treatment Groups in Propensity-Score Matched Samples.” Statistics in Medicine 28 (25): 3083–107. https://doi.org/10.1002/sim.3697.
Austin, Peter C. 2011. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research 46 (3): 399–424. https://doi.org/10.1080/00273171.2011.568786.
Hernán, Miguel A., and James M. Robins. 2020. Causal Inference: What If. Chapman; Hall/CRC.
Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology 66 (5): 688–701. https://doi.org/10.1037/h0037350.
VanderWeele, Tyler J., and Peng Ding. 2017. “Sensitivity Analysis in Observational Research: Introducing the e-Value.” Annals of Internal Medicine 167 (4): 268–74. https://doi.org/10.7326/M16-2607.