Propensity Scores: Balancing the Scales in Causal AI

Advanced Statistics

Propensity Scores

A practical guide to propensity scores, overlap, matching, weighting, and balance diagnostics for causal analyses in observational data.

Published

December 1, 2025

Modified

June 9, 2026

Executive Summary

Observational data are tempting because they are abundant, practical, and often closer to real-world decision settings than tightly controlled experiments.

But they come with a major problem:

treatment groups are usually not comparable at baseline.

Patients who receive one therapy rather than another may differ in:

severity,
age,
comorbidity burden,
access to care,
clinician preference,
or site-level practice patterns.

Those differences can confound the treatment-outcome relationship.

This is where propensity score methods become useful.

A propensity score is the probability of receiving treatment given observed covariates. It compresses many confounders into a single balancing score that can be used to:

match treated and untreated patients,
weight observations,
or stratify the sample.

This post introduces:

estimating propensity scores,
matching,
weighting,
stratification,
overlap,
and covariate balance diagnostics,

using a healthcare-style comparative effectiveness example.

Propensity scores matter because in observational data, treatment groups rarely begin on equal footing, and causal analysis needs a way to rebalance the comparison. ## Propensity Scores Begin with the Confounding Problem

In a randomized experiment, treatment assignment is designed to be unrelated to the patient’s potential outcomes, at least in expectation.

In observational data, treatment assignment is usually not random.

It is shaped by real processes such as:

case mix,
severity,
access,
preference,
and clinician judgment.

That means the crude treatment comparison may reflect both:

the causal effect of treatment,
and baseline differences between who did and did not receive it.

Propensity score methods are designed to reduce that imbalance using observed covariates.

They do not solve unmeasured confounding, but they can help create a more comparable treated and untreated sample with respect to measured variables. ## The Propensity Score Is a Treatment Probability

Formally, the propensity score is:

[ e(X) = P(T = 1 X)]

where:

1. is treatment assignment,
1. is the vector of observed covariates.

This quantity matters because of a key result:

if treatment assignment is ignorable given the covariates, then it is also ignorable given the propensity score.

That is why the propensity score is called a balancing score (Rosenbaum and Rubin 1983).

Rather than matching or adjusting on many covariates directly, the analyst can work through the probability of treatment conditional on those covariates.

That is the central simplification. ## A Healthcare-Style Example Makes the Problem Concrete

To illustrate, we will simulate an observational comparative effectiveness dataset in which sicker patients are more likely to receive treatment.

This is exactly the kind of setting where naïve comparisons can be misleading.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 800

ps_df <- tibble::tibble(
  age = rnorm(n, mean = 62, sd = 12),
  severity = rnorm(n, mean = 0, sd = 1),
  comorbidity = rnorm(n, mean = 0, sd = 1),
  prior_utilization = rnorm(n, mean = 0, sd = 1)
) |>
  dplyr::mutate(
    treat_prob = plogis(-0.8 + 0.03 * age + 1.2 * severity + 0.7 * comorbidity + 0.5 * prior_utilization),
    treatment = rbinom(n, size = 1, prob = treat_prob),
    y0 = 50 - 0.2 * age - 4 * severity - 2.5 * comorbidity - 1.5 * prior_utilization + rnorm(n, 0, 4),
    y1 = y0 + 3,
    outcome = if_else(treatment == 1, y1, y0)
  )

ps_df |>
  dplyr::summarise(
    treatment_rate = mean(treatment)
  )

# A tibble: 1 × 1
  treatment_rate
           <dbl>
1          0.694

Here, treated patients are more likely to have higher baseline severity and comorbidity burden.

That creates confounding by indication. ## The Naïve Treatment Comparison Is Usually Biased

Before using propensity scores, it helps to see the crude observational comparison.

crude_tbl <- ps_df |>
  dplyr::group_by(treatment) |>
  dplyr::summarise(
    mean_outcome = mean(outcome),
    .groups = "drop"
  ) |>
  tidyr::pivot_wider(
    names_from = treatment,
    values_from = mean_outcome
  ) |>
  dplyr::mutate(
    crude_difference = `1` - `0`
  )

crude_tbl

# A tibble: 1 × 3
    `0`   `1` crude_difference
  <dbl> <dbl>            <dbl>
1  41.4  38.5            -2.92

Even though treatment is beneficial in the data-generating process, the naïve estimate may be biased downward because the treated patients were sicker to begin with.

This is the confounding problem propensity score methods are trying to address. ## Logistic Regression Is a Common Way to Estimate Propensity Scores

A standard way to estimate propensity scores is logistic regression.

The treatment indicator is the outcome, and baseline covariates are the predictors.

ps_model <- glm(
  treatment ~ age + severity + comorbidity + prior_utilization,
  data = ps_df,
  family = binomial()
)

ps_df <- ps_df |>
  dplyr::mutate(
    ps = predict(ps_model, type = "response")
  )

ps_df |>
  dplyr::summarise(
    min_ps = min(ps),
    mean_ps = mean(ps),
    max_ps = max(ps)
  )

# A tibble: 1 × 3
   min_ps mean_ps max_ps
    <dbl>   <dbl>  <dbl>
1 0.00778   0.694  0.999

This gives each observation a predicted probability of treatment given the measured covariates.

That probability becomes the basis for matching, weighting, or stratification. ## Overlap Is One of the Most Important Diagnostics

A crucial assumption in propensity score work is overlap, also called positivity.

This means that across the relevant covariate space, units should have a nonzero probability of receiving either treatment.

If treated and untreated groups occupy very different parts of the covariate space, then causal comparison becomes unstable or impossible in those regions.

A simple way to inspect overlap is to compare the propensity score distributions by treatment group.

ggplot2::ggplot(ps_df, ggplot2::aes(x = ps, fill = factor(treatment))) +
  ggplot2::geom_density(alpha = 0.4) +
  ggplot2::labs(
    title = "Propensity Score Distributions by Treatment Group",
    x = "Estimated Propensity Score",
    y = "Density",
    fill = "Treatment"
  ) +
  ggplot2::theme_minimal()

Good overlap does not prove validity, but poor overlap is a strong warning sign. ## Matching Tries to Build Comparable Treated and Untreated Sets

One common use of propensity scores is matching.

The idea is:

for each treated unit,
find one or more untreated units with a similar propensity score,
then compare outcomes in the matched sample.

This can create a more interpretable sample where the treatment groups are directly paired or aligned.

A common choice is nearest-neighbor matching (Austin 2011).

required_pkgs <- c("MatchIt")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

match_fit <- MatchIt::matchit(
  treatment ~ age + severity + comorbidity + prior_utilization,
  data = ps_df,
  method = "nearest",
  distance = "logit"
)

summary(match_fit)

matched_df <- MatchIt::match.data(match_fit)

Matching is often one of the most intuitive propensity score approaches because it tries to create a pseudo-cohort of comparable treated and untreated patients. ## Balance Matters More Than the Propensity Model’s Fit

A very important practical point is that the goal of the propensity model is not predictive excellence.

It is not about maximizing classification accuracy for treatment assignment.

The goal is covariate balance.

A propensity score model that predicts treatment extremely well may actually be a warning sign if it implies poor overlap. What matters most is whether, after using the scores, the treated and untreated groups become more similar in the covariates.

That is one reason propensity score work is different from ordinary predictive modeling.

The evaluation target is balance, not treatment prediction accuracy. ## Standardized Mean Differences Are the Core Balance Diagnostic

One of the most common balance diagnostics is the standardized mean difference (SMD) (Austin 2009, 2011).

For a continuous covariate, this summarizes the difference in group means relative to pooled variability.

In practice, analysts often compare SMDs:

before adjustment,
and after matching or weighting.

Below is a simple helper to compute SMDs.

smd_cont <- function(x, treat) {
  x1 <- x[treat == 1]
  x0 <- x[treat == 0]
  (mean(x1) - mean(x0)) / sqrt((var(x1) + var(x0)) / 2)
}

balance_before <- tibble::tibble(
  variable = c("age", "severity", "comorbidity", "prior_utilization"),
  smd_before = c(
    smd_cont(ps_df$age, ps_df$treatment),
    smd_cont(ps_df$severity, ps_df$treatment),
    smd_cont(ps_df$comorbidity, ps_df$treatment),
    smd_cont(ps_df$prior_utilization, ps_df$treatment)
  )
)

balance_before

# A tibble: 4 × 2
  variable          smd_before
  <chr>                  <dbl>
1 age                    0.294
2 severity               0.913
3 comorbidity            0.610
4 prior_utilization      0.268

This gives a baseline picture of pre-adjustment imbalance. ## Weighting Uses the Propensity Score to Reweight the Sample

A second major approach is propensity score weighting.

The most common version for estimating the ATE uses inverse probability of treatment weighting (IPTW) (Rosenbaum and Rubin 1983; Austin 2011).

Weights are defined as:

[ w_i = ]

This reweights the sample so that the treated and untreated groups become more comparable with respect to measured covariates.

In effect, weighting creates a pseudo-population where treatment assignment is less confounded by the observed covariates.

ps_df <- ps_df |>
  dplyr::mutate(
    iptw = if_else(treatment == 1, 1 / ps, 1 / (1 - ps))
  )

ps_df |>
  dplyr::summarise(
    min_weight = min(iptw),
    mean_weight = mean(iptw),
    max_weight = max(iptw)
  )

# A tibble: 1 × 3
  min_weight mean_weight max_weight
       <dbl>       <dbl>      <dbl>
1       1.00        2.03       68.5

Large weights can signal limited overlap and may destabilize estimation. ## Weighted Balance Should Also Be Checked, Not Assumed

Just as matching should be evaluated by post-match balance, weighting should be evaluated by post-weighting balance.

A simple weighted mean helper can make this visible.

weighted_mean <- function(x, w) {
  sum(x * w) / sum(w)
}

weighted_var <- function(x, w) {
  mu <- weighted_mean(x, w)
  sum(w * (x - mu)^2) / sum(w)
}

weighted_smd <- function(x, treat, w) {
  x1 <- x[treat == 1]
  x0 <- x[treat == 0]
  w1 <- w[treat == 1]
  w0 <- w[treat == 0]
  
  m1 <- weighted_mean(x1, w1)
  m0 <- weighted_mean(x0, w0)
  v1 <- weighted_var(x1, w1)
  v0 <- weighted_var(x0, w0)
  
  (m1 - m0) / sqrt((v1 + v0) / 2)
}

balance_after_weight <- tibble::tibble(
  variable = c("age", "severity", "comorbidity", "prior_utilization"),
  smd_after_weighting = c(
    weighted_smd(ps_df$age, ps_df$treatment, ps_df$iptw),
    weighted_smd(ps_df$severity, ps_df$treatment, ps_df$iptw),
    weighted_smd(ps_df$comorbidity, ps_df$treatment, ps_df$iptw),
    weighted_smd(ps_df$prior_utilization, ps_df$treatment, ps_df$iptw)
  )
)

balance_after_weight

# A tibble: 4 × 2
  variable          smd_after_weighting
  <chr>                           <dbl>
1 age                           0.0712 
2 severity                     -0.0313 
3 comorbidity                   0.00916
4 prior_utilization            -0.116

This lets us compare balance before and after weighting. ## A Love Plot Style Comparison Helps Communicate Balance Improvement

It is often helpful to put pre- and post-adjustment balance side by side.

balance_plot_df <- balance_before |>
  dplyr::left_join(balance_after_weight, by = "variable") |>
  tidyr::pivot_longer(
    cols = c(smd_before, smd_after_weighting),
    names_to = "stage",
    values_to = "smd"
  )

ggplot2::ggplot(balance_plot_df, ggplot2::aes(x = smd, y = variable, color = stage)) +
  ggplot2::geom_point(size = 3) +
  ggplot2::geom_vline(xintercept = c(-0.1, 0.1), linetype = 2) +
  ggplot2::labs(
    title = "Covariate Balance Before and After Weighting",
    x = "Standardized Mean Difference",
    y = "Covariate",
    color = "Stage"
  ) +
  ggplot2::theme_minimal()

This kind of plot often communicates balance improvement much better than a table alone. ## Stratification Uses the Propensity Score to Form Comparable Subclasses

A third common approach is stratification, or subclassification.

The idea is to divide the sample into strata based on the propensity score, such as quintiles, and then compare treated and untreated outcomes within those strata.

If the propensity score is doing its job, units within a stratum should be more comparable in their observed covariates than the full raw sample.

ps_df <- ps_df |>
  dplyr::mutate(
    ps_quintile = dplyr::ntile(ps, 5)
  )

ps_df |>
  dplyr::count(ps_quintile, treatment)

# A tibble: 10 × 3
   ps_quintile treatment     n
         <int>     <int> <int>
 1           1         0   116
 2           1         1    44
 3           2         0    69
 4           2         1    91
 5           3         0    31
 6           3         1   129
 7           4         0    22
 8           4         1   138
 9           5         0     7
10           5         1   153

Stratification is often a useful middle ground:

more structured than crude adjustment,
less sample-pruning than matching,
and often easy to explain. ## Estimating the Effect After Weighting Is Straightforward

Once weights are available, the treatment effect can be estimated in a weighted outcome model.

fit_weighted <- lm(outcome ~ treatment, data = ps_df, weights = iptw)

summary(fit_weighted)$coefficients

             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 37.096521  0.3088440 120.114124 0.000000e+00
treatment    2.898237  0.4436969   6.532021 1.154953e-10

This is one simple ATE-style estimate under IPTW.

In real applied work, analysts may prefer robust standard errors or more specialized weighted estimators, but the essential idea is the same:

first balance the groups,
then estimate the effect in the reweighted sample. ## Matching, Weighting, and Stratification Target Slightly Different Analytic Goals

These three methods are closely related, but they are not identical.

Matching

Often targets a matched comparison that may be closer to the ATT, depending on design.

Weighting

Often used to target the ATE or ATT, depending on the weights used.

Stratification

Provides a subclass-based approximation to adjusted comparison.

So the choice is not merely technical. It depends on:

the target estimand,
overlap quality,
sample size,
and interpretability needs.

That is why it is important to define the causal estimand clearly before choosing the method. ## Overlap Problems Can Make Propensity Methods Fragile

A major vulnerability in propensity score analysis is poor overlap.

If some patients are almost certain to be treated and others are almost certain not to be treated, then causal comparison in those regions becomes weak.

This can produce:

unstable weights,
poor matches,
and unreliable extrapolation.

That is why analysts sometimes:

trim extreme propensity scores,
restrict to regions of common support,
or report that the causal question is only answerable in the overlapping population.

This is not a minor detail. It is one of the most important limitations in observational causal work. ## Propensity Scores Only Balance Measured Confounders

One of the most important cautions is that propensity score methods do not solve unmeasured confounding.

They only adjust for variables that were:

measured,
included,
and modeled adequately.

That means a beautifully balanced matched or weighted sample can still be biased if an important confounder was omitted or poorly captured.

This is why causal claims from propensity score methods should always be paired with careful discussion of:

design,
covariate measurement quality,
and residual confounding risk.

Propensity scores are powerful tools, but they are not magic. ## Propensity Scores Now Integrate Naturally with ML Workflows

Modern causal ML often extends classical propensity score methods using flexible learners to estimate the treatment probability.

For example, instead of logistic regression, the propensity model might use:

random forests,
gradient boosting,
super learner,
or other ensemble approaches.

This can improve fit when treatment assignment is complex and nonlinear.

But the same principle remains:

the goal is balance, not prediction for its own sake.

This is why even when ML is used to estimate propensity scores, classical balance diagnostics still matter.

That is one of the strongest bridges between traditional causal inference and modern AI methods.

Trauma Registry Application: Propensity Scores for Comparative Effectiveness

Trauma registries are a natural setting for propensity score methods because treatment decisions are driven by patient severity, site capacity, and clinician judgment — not randomization.

Classic examples where propensity scores are used in trauma research:

Transfer decisions: patients transferred to Level I centers versus treated locally differ systematically in injury severity, distance, and resource availability.
Massive transfusion protocol (MTP): patients who receive MTP are already more severely injured — naïve comparisons confound treatment with severity.
Surgical timing: early versus delayed operative intervention is shaped by injury pattern, hemodynamic stability, and resource availability.

In each case, the propensity score compresses the confounding structure into a single balancing score (Rosenbaum and Rubin 1983), allowing matched or weighted comparisons that are more credible than crude group differences.

Balance diagnostics — especially SMDs before and after adjustment — are the standard of evidence that your comparison is defensible (Austin 2009).

A Practical Checklist for Applied Work

Before reporting a propensity score analysis, ask:

What is the causal estimand: ATE, ATT, or something else?
Which variables plausibly confound treatment and outcome?
Is the propensity model specified using only pre-treatment covariates?
Is there reasonable overlap between treatment groups?
Did matching, weighting, or stratification actually improve balance?
Are extreme weights or poor matches creating instability?
Could unmeasured confounding still materially bias the result?

These questions often matter more than whether the propensity score model itself “looks good.”

Where This Shows Up in AI/ML

MAVEN platform analyses comparing treatment protocols across military treatment facilities — massive transfusion ratios, tourniquet-to-OR time, resuscitative endovascular balloon of the aorta use — require propensity adjustment because patient selection into treatment reflects injury severity, not random assignment. An unadjusted comparison will show that the most aggressive interventions are associated with the worst outcomes, simply because they are applied to the most injured patients. Propensity score weighting reconstructs the pseudo-population in which treatment assignment is independent of measured severity covariates, making the comparison interpretable. The failure mode is balance checking: analysts who build a propensity model but never verify covariate balance after weighting have no evidence the adjustment worked.

Closing: Propensity Scores Help Rebuild Comparability in Observational Data

Propensity score methods remain central in causal inference because they offer a principled way to address one of the core problems of observational data:

treated and untreated groups are rarely comparable at baseline.

Matching builds more similar comparison groups. Weighting creates a pseudo-population with improved covariate balance. Stratification organizes comparison within propensity-defined subclasses.

All three methods aim at the same broader goal:

to make the observed comparison more closely resemble the counterfactual comparison we wish we had.

Propensity scores matter because causal inference in observational data is not only about modeling outcomes, but about restoring fairness to the treatment comparison before the outcome model is even trusted.

📚 Go Deeper: Causal Inference Toolkit

This post is part of the Causal Inference Toolkit — a companion reference with propensity score estimation templates, balance diagnostic code, IPTW scaffolds, and Love plot workflows for observational comparative effectiveness analyses.

→ Open the Causal Inference Toolkit

References

Austin, Peter C. 2009. “Balance Diagnostics for Comparing the Distribution of Baseline Covariates Between Treatment Groups in Propensity-Score Matched Samples.” Statistics in Medicine 28 (25): 3083–107. https://doi.org/10.1002/sim.3697.

Austin, Peter C. 2011. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies.” Multivariate Behavioral Research 46 (3): 399–424. https://doi.org/10.1080/00273171.2011.568786.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.