A practical toolkit for observational causal analysis, including estimand framing, DAG-guided adjustment, propensity score diagnostics, standardized balance checks, and sensitivity analysis prompts.
Published
April 1, 2026
Modified
June 9, 2026
Executive Summary
This toolkit is a reusable framework for causal analysis in observational data.
It includes:
estimand framing
DAG-oriented covariate selection prompts
propensity score workflow templates
balance diagnostics
weighting and matching scaffolds
sensitivity-analysis prompts
reviewer-facing language for transparent reporting
The goal is not to make observational data behave like randomized data by assertion. The goal is to make assumptions explicit, adjustment strategies legible, and conclusions more defensible (Hernán and Robins 2020; Pearl 2009; Rosenbaum and Rubin 1983).
1. Start With the Causal Question
Before fitting any model, define the causal question in plain language.
A useful minimum structure is:
treatment/exposure: what is being compared?
outcome: what endpoint matters?
target population: to whom does the effect apply?
time zero: when does follow-up begin?
estimand: what causal contrast is intended?
Without that structure, many “causal” analyses are really predictive or descriptive analyses wearing causal language (Hernán and Robins 2020).
2. Define the Estimand Explicitly
For example, a simple average treatment effect can be written as:
\[
ATE = E\{Y(1) - Y(0)\}
\]
An average treatment effect in the treated can be written as:
\[
ATT = E\{Y(1) - Y(0) \mid A=1\}
\]
The first question in a toolkit workflow is not “which estimator should I use?” but “which estimand am I targeting?” (Hernán and Robins 2020; Rubin 1974).
3. DAG-Oriented Covariate Selection Prompts
A directed acyclic graph is a thinking tool for deciding which variables are plausible confounders, mediators, colliders, or proxies (Pearl 2009).
Before adjustment, document each candidate variable under one of these labels:
likely confounder
likely mediator
possible collider
exposure proxy
outcome proxy
precision variable only
A simple worksheet table is often enough.
dag_adjustment_table <- tibble::tribble(~variable, ~role, ~include_in_primary_adjustment, ~rationale,"age", "likely confounder", TRUE, "Affects both treatment assignment and outcome risk","injury_severity", "likely confounder", TRUE, "Strong driver of treatment and outcome","post_treatment_lab", "likely mediator", FALSE, "Measured after treatment initiation","consult_service", "possible collider", FALSE, "May reflect downstream care pathway")dag_adjustment_table
# A tibble: 4 × 4
variable role include_in_primary_adjustment rationale
<chr> <chr> <lgl> <chr>
1 age likely confounder TRUE Affects bo…
2 injury_severity likely confounder TRUE Strong dri…
3 post_treatment_lab likely mediator FALSE Measured a…
4 consult_service possible collider FALSE May reflec…
# Example:# ps_fit <- stats::glm(# treatment ~ age + severity + comorbidity + mechanism,# data = data,# family = stats::binomial()# )## data$ps <- stats::predict(ps_fit, type = "response")
The purpose of the propensity score is not prediction quality by itself. Its purpose is to help achieve covariate balance relevant to the causal question (Austin 2011).
5. Standardized Mean Difference Diagnostics
Covariate balance should be evaluated with standardized differences rather than significance tests (Austin 2009).
make_ate_weights <-function(treatment, ps, trim =NULL) { w <-ifelse(treatment ==1, 1/ ps, 1/ (1- ps))if (!is.null(trim)) { lo <- stats::quantile(w, probs = trim, na.rm =TRUE) hi <- stats::quantile(w, probs =1- trim, na.rm =TRUE) w <-pmin(pmax(w, lo), hi) } w}# Example:# data$w_ate <- make_ate_weights(data$treatment, data$ps, trim = 0.01)
Weights should always be examined for positivity problems and extreme values before outcome modeling (Hernán and Robins 2020).
7. Outcome Modeling After Weighting
# Example:# fit_weighted <- stats::glm(# outcome ~ treatment,# data = data,# family = stats::binomial(),# weights = w_ate# )## summary(fit_weighted)
This simple model is not the whole causal argument. It is the last step in a workflow that depends on correct time alignment, reasonable covariate selection, adequate overlap, and diagnostic transparency.
8. Matching Template
Some applications are better served by matching than by weighting.
The right matching strategy depends on the estimand, overlap, and tolerance for bias-variance tradeoffs.
9. Sensitivity Analysis Prompts
A causal workflow should always state what could still go wrong after adjustment.
Minimum prompts:
Could an important confounder remain unmeasured?
Could treatment timing create immortal time bias?
Could post-treatment variables have entered the adjustment set?
Could selection into the analytic sample distort the estimand?
Is the effect likely transportable to the target population?
When appropriate, an E-value can be used as one sensitivity summary for unmeasured confounding (VanderWeele and Ding 2017).
10. Reviewer-Facing Reporting Language
Use language like this in methods or appendices:
The primary analysis was designed to estimate a causal contrast from observational data rather than to optimize prediction. We first defined the target estimand and time zero, then selected adjustment variables based on subject-matter reasoning and DAG-oriented thinking. Propensity score methods were used to improve measured covariate balance, and balance was assessed with standardized differences rather than significance tests (Hernán and Robins 2020; Rosenbaum and Rubin 1983; Austin 2009).
11. Minimum Reporting Checklist
treatment, outcome, target population, and time zero defined
estimand stated explicitly
covariate roles documented
positivity and overlap considered
balance diagnostics shown
weighting or matching decisions justified
residual bias and sensitivity limits discussed
predictive claims separated from causal claims
NoteWhere This Shows Up in AI/ML
The single most important methodological distinction in military health AI is between a model built to predict outcomes and a model built to inform interventions — the former requires only predictive validity, the latter requires causal identification. MAVEN decision support tools that recommend treatment protocols based on predicted outcome improvements are implicitly causal claims, and deploying them without causal validation is the intervention fallacy at scale. The target trial emulation framework, applied to DoDTR linked with MHS GENESIS, is the rigorous path to intervention-informing evidence from observational military health data. A prediction model embedded in a clinical workflow without causal grounding will optimize for the observable patterns in its training data — including patterns driven by physician judgment, resource availability, and documentation practice — not for the treatment effects that actually improve outcomes.
12. Closing
Causal inference in observational data is not a matter of choosing a fashionable estimator. It is a matter of disciplined design logic, transparent assumptions, and careful diagnostics.
A good causal toolkit does not promise certainty. It makes the causal argument easier to inspect.
Series Callout
Note
This post is part of a broader Toolkit Series for Applied Statistics, AI, and Clinical Analytics:
Austin, Peter C. 2009. “Balance Diagnostics for Comparing the Distribution of Baseline Covariates Between Treatment Groups in Propensity-Score Matched Samples.”Statistics in Medicine 28 (25): 3083–107. https://doi.org/10.1002/sim.3697.
Austin, Peter C. 2011. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.”Multivariate Behavioral Research 46 (3): 399–424. https://doi.org/10.1080/00273171.2011.568786.
Hernán, Miguel A., and James M. Robins. 2020. Causal Inference: What If. Chapman; Hall/CRC.
Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.”Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.
Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.”Journal of Educational Psychology 66 (5): 688–701. https://doi.org/10.1037/h0037350.
VanderWeele, Tyler J., and Peng Ding. 2017. “Sensitivity Analysis in Observational Research: Introducing the e-Value.”Annals of Internal Medicine 167 (4): 268–74. https://doi.org/10.7326/M16-2607.