Causal Inference: Potential Outcomes & Propensity Scores

Advanced Statistics for AI & Clinical Decision-Making — Lecture 2 of 4

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

Correlation tells you what happened together. Causation tells you what would happen if you intervened.

What You’ll Learn Today

Post 04 — Causal Inference Foundations

  • The fundamental problem of causal inference
  • Rubin’s potential outcomes framework
  • ATE, ATT — what are we actually estimating?
  • Why observational data breaks the easy comparison
  • Confounding as assignment mechanism bias

Post 05 — Propensity Scores

  • The propensity score theorem
  • Overlap and positivity
  • Matching vs. weighting vs. stratification
  • Balance diagnostics (SMD, Love plots)
  • Limitations: unmeasured confounding

Part 1

The Potential Outcomes Framework

Every causal question is a counterfactual question

The Fundamental Problem of Causal Inference

For each patient \(i\), define:

  • \(Y_i(1)\) = outcome if treated
  • \(Y_i(0)\) = outcome if untreated

Individual causal effect: \(\tau_i = Y_i(1) - Y_i(0)\)

We observe only one. The other is the counterfactual.

What we want:

\[\text{ATE} = E[Y(1) - Y(0)] = E[Y(1)] - E[Y(0)]\]

What observational data gives us:

\[E[Y \mid A=1] - E[Y \mid A=0]\]

These are not the same when treatment assignment is related to the outcome.

Trauma example:

Damage control surgery (DCS) vs. definitive surgery.

Sicker patients receive DCS. Naive comparison: DCS patients have worse outcomes.

Is that DCS’s effect — or just selection of sicker patients into DCS?

Randomization Solves the Problem

n <- 500
# Simulated confounder: severity
severity <- rnorm(n, 0, 1)

# RCT: treatment assigned independently
trt_rct  <- rbinom(n, 1, 0.5)
y_rct    <- 10 - 2*trt_rct + 1.5*severity + rnorm(n, 0, 2)

# Observational: treatment correlated with severity
trt_obs  <- rbinom(n, 1, plogis(-1 + 2*severity))
y_obs    <- 10 - 2*trt_obs + 1.5*severity + rnorm(n, 0, 2)

# Naive estimates
naive_rct <- coef(lm(y_rct ~ trt_rct))["trt_rct"]
naive_obs <- coef(lm(y_obs ~ trt_obs))["trt_obs"]
adj_obs   <- coef(lm(y_obs ~ trt_obs + severity))["trt_obs"]

tibble(
  Estimator    = c("RCT (naive)", "Observational (naive)", "Observational (adjusted)"),
  Estimate     = round(c(naive_rct, naive_obs, adj_obs), 3),
  `True ATE`   = -2
)
# A tibble: 3 × 3
  Estimator                Estimate `True ATE`
  <chr>                       <dbl>      <dbl>
1 RCT (naive)                -2.04          -2
2 Observational (naive)      -0.165         -2
3 Observational (adjusted)   -1.83          -2

Randomization makes \(E[Y(0) \mid A=1] = E[Y(0) \mid A=0]\) — the groups are comparable on everything, observed and unobserved. Observational data breaks this. Adjustment recovers it — but only for measured confounders.

The Three Estimands

Estimand Definition Clinical translation
ATE \(E[Y(1) - Y(0)]\) Population-average effect if everyone were treated vs. no one
ATT \(E[Y(1) - Y(0) \mid A=1]\) Effect in those who actually received treatment
ATU \(E[Y(1) - Y(0) \mid A=0]\) Effect if untreated patients had been treated

Which estimand to use: For policy decisions (“should we change the protocol?”) → ATE. For understanding current practice (“does DCS help the patients who get it?”) → ATT. Propensity score matching estimates ATT. Weighting (IPTW) estimates ATE.

Part 2

Propensity Scores

Balancing treatment groups on observed covariates

The Propensity Score Theorem

\[e(X) = P(A=1 \mid X)\]

Rosenbaum & Rubin (1983): If treatment is unconfounded given \(X\), it is also unconfounded given \(e(X)\) alone.

Dimension reduction: balance on 1 score instead of all \(p\) covariates.

Key assumptions required: 1. Unconfoundedness (ignorability): \(Y(a) \perp A \mid X\) — no unmeasured confounders 2. Positivity (overlap): \(0 < e(X) < 1\) — every patient could plausibly receive either treatment 3. SUTVA: stable unit treatment value assumption — one version of treatment, no interference

Estimating the Propensity Score

n <- 600
df_ps <- tibble(
  age      = rnorm(n, 38, 14),
  iss      = rnorm(n, 28, 12),
  sbp      = rnorm(n, 108, 24),
  mechanism = rbinom(n, 1, 0.45)  # penetrating vs blunt
) |> mutate(
  # Treatment (DCS) more likely for high ISS, low SBP
  ps_true  = plogis(-1.5 + 0.06*iss - 0.015*sbp + 0.4*mechanism),
  treated  = rbinom(n, 1, ps_true),
  ps_hat   = predict(glm(treated ~ age + iss + sbp + mechanism,
                         family=binomial), type="response")
)

# Overlap plot
df_ps |>
  mutate(Group = ifelse(treated==1, "Treated (DCS)", "Control (Definitive)")) |>
  ggplot(aes(ps_hat, fill=Group, color=Group)) +
  geom_density(alpha=0.4, linewidth=0.8) +
  scale_fill_manual(values=c("#2563eb","#e63946")) +
  scale_color_manual(values=c("#2563eb","#e63946")) +
  labs(title="Propensity score overlap — are distributions compatible?",
       x="Estimated propensity score P(DCS | X)", y="Density", fill=NULL, color=NULL) +
  theme_di()

Good overlap: distributions overlap across most of [0,1]. Poor overlap: groups are completely separated → positivity violated, estimates unreliable in the non-overlapping region.

Balance Diagnostics: SMD Before vs. After Weighting

# IPTW weights
df_ps <- df_ps |> mutate(
  iptw = ifelse(treated==1, 1/ps_hat, 1/(1-ps_hat))
)

# Compute SMD before and after
smd <- function(x, trt, w=NULL) {
  if (is.null(w)) w <- rep(1, length(x))
  m1 <- weighted.mean(x[trt==1], w[trt==1])
  m0 <- weighted.mean(x[trt==0], w[trt==0])
  s  <- sqrt((var(x[trt==1]) + var(x[trt==0]))/2)
  abs(m1 - m0) / s
}

vars <- c("age","iss","sbp","mechanism")
before <- sapply(vars, function(v) smd(df_ps[[v]], df_ps$treated))
after  <- sapply(vars, function(v) smd(df_ps[[v]], df_ps$treated, df_ps$iptw))

smd_wide <- tibble(Variable=vars, Before=before, After=after)
smd_wide |>
  pivot_longer(-Variable) |>
  left_join(select(smd_wide, Variable, Before), by="Variable") |>
  ggplot(aes(value, reorder(Variable, -Before), color=name, shape=name)) +
  geom_point(size=4) +
  geom_vline(xintercept=0.1, linetype=2, color="#e63946") +
  scale_color_manual(values=c("#94a3b8","#22d3ee")) +
  labs(title="Love plot: SMD before (gray) and after IPTW (teal) — target < 0.10",
       x="Standardized Mean Difference", y=NULL, color=NULL, shape=NULL) +
  theme_di()

SMD < 0.1 is the conventional balance threshold. After IPTW, all variables are well-balanced — the groups are now comparable on measured confounders.

Matching vs. Weighting vs. Stratification

Method Estimand Strengths Weaknesses
Matching ATT Transparent, interpretable Loses unmatched controls
IPTW ATE Uses all data Extreme weights → variance inflation
Trimmed IPTW Approximate ATE Stabilizes variance Some estimand ambiguity
Stratification ATE Simple, robust Residual confounding within strata

Registry practice: IPTW with stabilized weights is the workhorse for comparative effectiveness in trauma registries. Stabilized weights \(= \frac{P(A)}{e(X)}\) cap extreme values and reduce variance. Always trim weights > 99th percentile and check sensitivity to trimming threshold.

The Limitation That Cannot Be Engineered Away

Propensity scores only balance measured confounders.

If an unmeasured variable determines both treatment and outcome, propensity score methods cannot remove the bias — no matter how well the model fits.

E-value approach (Ding & VanderWeele): For a risk ratio of RR, the minimum unmeasured confounding strength needed to explain away the result is:

\[\text{E-value} = \text{RR} + \sqrt{\text{RR} \cdot (\text{RR} - 1)}\]

A large E-value means strong robustness. A small E-value means a modest unmeasured confounder could explain the entire result.

Lecture 2 — Key Takeaways

Potential Outcomes

  • Every causal question is a counterfactual question
  • ATE vs. ATT — choose before analysis, not after
  • Randomization solves the fundamental problem; observational data requires assumptions
  • Adjustment works only for measured confounders

Propensity Scores

  • \(e(X) = P(A=1 \mid X)\) — logistic regression is standard
  • Matching → ATT; IPTW → ATE

Diagnostics That Cannot Be Skipped

  • Overlap plot — positivity check
  • Love plot (SMD) — balance check
  • SMD < 0.10 after weighting is the target
  • Sensitivity to weight trimming

The meta-lesson: Propensity scores are a tool for making observational data less wrong — not for making it equivalent to a randomized trial. Know the assumptions. Report them. Quantify sensitivity.

Coming Up: Lecture 3

Identification Strategies — When Standard Adjustment Fails

Posts 06, 07 & 08:

  • Instrumental variables — causal ID under unmeasured confounding
  • DAGs & confounding — graphical causal reasoning, E-values
  • Target trial emulation — designing observational studies like trials