Taming Confounders: Bias-Free Insights from Real-World Data

Advanced Statistics

A practical introduction to confounding, causal diagrams, standardization, and sensitivity analysis in real-world evidence.

Published

January 15, 2026

Modified

June 9, 2026

Executive Summary

Real-world evidence can be powerful because it reflects how care, behavior, and outcomes unfold outside tightly controlled experiments.

But it also comes with a major vulnerability:

treatment groups and exposed groups often differ before the outcome ever occurs.

Those differences can generate confounding and other forms of bias (Greenland et al. 1999; Hernán et al. 2002).

If they are not handled carefully, an observational analysis can mistake:

selection for effect,
baseline risk for treatment benefit,
or healthcare utilization patterns for causal signal.

This is why confounding adjustment matters so much in real-world evidence.

This post introduces:

what confounding is,
how directed acyclic graphs (DAGs) help identify adjustment sets (Greenland et al. 1999),
regression adjustment,
standardization,
and E-values as a way to assess sensitivity to unmeasured confounding (VanderWeele and Ding 2017).

The broader goal is not only statistical correction. It is causal discipline.

Confounding matters because observational data rarely compare like with like, and without adjustment, the estimated effect may reflect who got treated rather than what treatment did.

Real-World Evidence Is Valuable Because It Is Real — and Vulnerable Because It Is Real

Real-world evidence often uses data from:

clinical practice,
claims,
registries,
electronic health records,
and other nonrandomized settings.

This is valuable because it reflects:

heterogeneous patients,
pragmatic care patterns,
and outcomes in less controlled environments.

But the same realism creates vulnerability.

Patients are not assigned treatments randomly. They enter data systems unevenly. They differ in severity, access, adherence, surveillance intensity, and follow-up.

That means bias adjustment is not optional. It is one of the main conditions for making observational evidence interpretable.

Confounding Happens When a Third Variable Distorts the Exposure–Outcome Relationship

A confounder is a variable that affects both (Hernán et al. 2002):

the exposure or treatment,
and the outcome.

If it is not properly accounted for, the estimated exposure–outcome association may be biased.

For example, in a pharmacoepidemiology setting:

sicker patients may be more likely to receive a stronger therapy,
and sickness severity may also increase the adverse outcome risk.

If we compare treated and untreated patients without accounting for baseline severity, we may attribute severity-driven risk to the treatment itself.

That is confounding.

Confounding Is Not Just a Regression Problem — It Is a Design Problem

A common mistake is to think of confounding as something that can always be fixed automatically by “adding covariates to the model.”

That is too simplistic.

Confounding starts earlier than the model. It is a feature of how treatment or exposure was assigned in the data-generating process.

That is why good confounding adjustment begins with a conceptual question:

what variables create backdoor paths between exposure and outcome?

This is one reason DAGs are so useful. They help the analyst think about causal structure before touching the regression formula.

DAGs Help Make Confounding Structure Explicit

A directed acyclic graph (DAG) is a graphical representation of causal assumptions.

In a DAG:

arrows represent assumed causal influence,
nodes represent variables,
and the graph helps identify which paths create confounding.

For a simple observational treatment example, a DAG might look like:

baseline severity affects treatment
baseline severity affects outcome
treatment affects outcome

That creates a backdoor path from treatment to outcome through severity.

Blocking that path is the goal of adjustment.

This is why DAGs are not just diagrams. They are tools for deciding what to adjust for — and what not to adjust for.

A Simple RWE-Style Example Makes the Problem Concrete

To illustrate, we will simulate an observational healthcare dataset in which treatment assignment depends on baseline severity and comorbidity.

This is exactly the kind of setting where naïve analyses can confuse confounding for effect.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 1000

rwe_df <- tibble::tibble(
  age = rnorm(n, mean = 64, sd = 11),
  severity = rnorm(n, mean = 0, sd = 1),
  comorbidity = rnorm(n, mean = 0, sd = 1),
  utilization = rnorm(n, mean = 0, sd = 1)
) |>
  dplyr::mutate(
    treat_prob = plogis(-1.0 + 0.03 * age + 1.1 * severity + 0.8 * comorbidity + 0.6 * utilization),
    treatment = rbinom(n, size = 1, prob = treat_prob),
    outcome_prob = plogis(-2.0 + 0.5 * treatment + 1.2 * severity + 0.9 * comorbidity + 0.5 * utilization),
    outcome = rbinom(n, size = 1, prob = outcome_prob)
  )

rwe_df |>
  dplyr::summarise(
    treatment_rate = mean(treatment),
    outcome_rate = mean(outcome)
  )

# A tibble: 1 × 2
  treatment_rate outcome_rate
           <dbl>        <dbl>
1          0.674        0.259

This creates a setting with confounding by indication and utilization intensity.

The Crude Association Mixes Treatment Effect and Confounding

Before adjusting, let us look at the naïve treatment–outcome association.

crude_tbl <- rwe_df |>
  dplyr::group_by(treatment) |>
  dplyr::summarise(
    outcome_risk = mean(outcome),
    .groups = "drop"
  ) |>
  tidyr::pivot_wider(
    names_from = treatment,
    values_from = outcome_risk
  ) |>
  dplyr::mutate(
    crude_risk_difference = `1` - `0`
  )

crude_tbl

# A tibble: 1 × 3
     `0`   `1` crude_risk_difference
   <dbl> <dbl>                 <dbl>
1 0.0859 0.343                 0.257

This crude comparison does not isolate the treatment effect.

It reflects both:

the treatment,
and the fact that higher-risk patients were more likely to be treated.

That is the central confounding problem.

DAG Thinking Helps Decide What to Adjust For

In this example, the confounding variables are:

age,
severity,
comorbidity,
utilization.

A DAG would suggest adjusting for these because they influence both treatment and outcome.

A simplified DAG structure might be (Greenland et al. 1999):

age → treatment
age → outcome
severity → treatment
severity → outcome
comorbidity → treatment
comorbidity → outcome
utilization → treatment
utilization → outcome
treatment → outcome

This kind of diagram clarifies the adjustment target: block the backdoor paths from treatment to outcome.

In practice, tools like dagitty are especially useful for this.

required_pkgs <- c("dagitty", "ggdag")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
$$

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

dag <- dagitty::dagitty("
dag {
  age -> treatment
  age -> outcome
  severity -> treatment
  severity -> outcome
  comorbidity -> treatment
  comorbidity -> outcome
  utilization -> treatment
  utilization -> outcome
  treatment -> outcome
}
")

dagitty::adjustmentSets(dag, exposure = "treatment", outcome = "outcome")
ggdag::ggdag(dag)

This is a very practical way to support adjustment decisions visually.

Regression Adjustment Is One of the Most Familiar Strategies

A common approach is to adjust for confounders in a regression model.

For a binary outcome, logistic regression is often used.

fit_adj <- glm(
  outcome ~ treatment + age + severity + comorbidity + utilization,
  data = rwe_df,
  family = binomial()
)

summary(fit_adj)$coefficients

                Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -1.532162547 0.529829752 -2.891802 3.830397e-03
treatment    0.682437980 0.252954982  2.697863 6.978608e-03
age         -0.009045539 0.008115531 -1.114596 2.650236e-01
severity     1.144755054 0.110757058 10.335730 4.856994e-25
comorbidity  0.924762200 0.101876995  9.077243 1.113602e-19
utilization  0.355843055 0.092091933  3.863998 1.115461e-04

The treatment coefficient now reflects an association adjusted for the measured confounders.

This is often a useful start, but it is important to remember:

regression adjustment is only as good as the measured covariates and model form.

It does not remove unmeasured confounding.

Standardization Is a More Explicit Way to Build a Fair Comparison

Another useful adjustment approach is standardization.

The idea is:

model the outcome as a function of treatment and confounders
predict each person’s outcome under treatment
predict each person’s outcome under no treatment
average across the sample

This produces a marginal contrast that is easier to interpret causally than a raw coefficient alone.

pred_treat <- rwe_df |>
  dplyr::mutate(treatment = 1)

pred_control <- rwe_df |>
  dplyr::mutate(treatment = 0)

risk_if_treated <- mean(predict(fit_adj, newdata = pred_treat, type = "response"))
risk_if_control <- mean(predict(fit_adj, newdata = pred_control, type = "response"))

tibble::tibble(
  standardized_risk_treated = risk_if_treated,
  standardized_risk_control = risk_if_control,
  standardized_risk_difference = risk_if_treated - risk_if_control
)

# A tibble: 1 × 3
  standardized_risk_treated standardized_risk_control standardized_risk_differ…¹
                      <dbl>                     <dbl>                      <dbl>
1                     0.276                     0.188                     0.0880
# ℹ abbreviated name: ¹standardized_risk_difference

This is often one of the clearest ways to connect regression adjustment to a causal estimand.

Standardization Answers a Population-Level Counterfactual Question

The key strength of standardization is that it maps directly onto the causal question:

what would the average outcome risk be if everyone were treated versus if no one were treated, holding the covariate distribution fixed?

That is a counterfactual population question.

This is why standardization is often more intuitively causal than simply reading a treatment coefficient from a regression table.

It makes the intervention contrast explicit.

Confounding Adjustment Depends on Measuring the Right Variables

One of the hardest parts of real-world evidence is not statistical implementation. It is deciding which variables actually need adjustment.

A good adjustment set should generally include:

pre-treatment confounders,
variables that affect both treatment and outcome,
and variables that help reduce residual confounding.

But not every variable should be adjusted for.

Analysts should be cautious about adjusting for:

mediators,
colliders,
or variables affected by treatment.

That is another reason DAGs matter. They help distinguish confounders from variables that could create new bias if adjusted inappropriately.

Overadjustment and Collider Adjustment Are Real Risks

Not every covariate belongs in the model.

For example:

a mediator lies on the causal pathway from treatment to outcome
a collider is influenced by both treatment and another cause of outcome

Adjusting for these can distort the causal question.

This is why the phrase “adjust for everything available” is often poor causal advice.

Confounding adjustment is not about maximizing variable count. It is about choosing the right variables based on causal structure.

That is a major conceptual difference between causal modeling and pure prediction.

E-Values Help Quantify Sensitivity to Unmeasured Confounding

Even after careful adjustment, an important question remains:

how strong would an unmeasured confounder need to be to explain away the observed association?

One tool for this is the E-value.

The E-value is a sensitivity metric that quantifies the minimum strength of association that an unmeasured confounder would need to have with both:

the treatment,
and the outcome,

to fully explain away the observed effect estimate, conditional on the measured covariates.

This does not solve unmeasured confounding. But it does help calibrate how fragile or robust the adjusted result may be.

E-Values Are Most Commonly Framed for Ratio Measures

E-values are most naturally computed for ratio-based measures such as:

risk ratios,
odds ratios,
or hazard ratios.

A simple example is easiest with a ratio estimate.

Below, we fit a simplified adjusted logistic model and extract the treatment odds ratio.

or_treatment <- exp(coef(fit_adj)[["treatment"]])

tibble::tibble(
  adjusted_odds_ratio = or_treatment
)

# A tibble: 1 × 1
  adjusted_odds_ratio
                <dbl>
1                1.98

For an odds ratio above 1, a simple E-value formula is motivated by the sensitivity-analysis framework introduced by VanderWeele and Ding (VanderWeele and Ding 2017):

\[ E = RR + \sqrt{RR(RR - 1)} \]

Strictly speaking, the most interpretable use is often with risk ratios, but the simple ratio-style calculation helps illustrate the idea.

evalue_fn <- function(rr) {
  rr + sqrt(rr * (rr - 1))
}

if (or_treatment > 1) {
  e_val <- evalue_fn(or_treatment)
} else {
  e_val <- evalue_fn(1 / or_treatment)
}

tibble::tibble(
  e_value = e_val
)

# A tibble: 1 × 1
  e_value
    <dbl>
1    3.37

This gives a sense of how strong an unmeasured confounder would have to be to explain away the treatment association.

E-Values Do Not Replace Design, but They Improve Transparency

It is important not to overstate what E-values do.

They do not prove that unmeasured confounding is absent.

They do not identify the hidden confounder.

They do help answer a useful question:

would it take a weak, moderate, or very strong unmeasured confounder to overturn the finding?

That makes them useful for communicating sensitivity in real-world evidence settings where perfect covariate capture is unrealistic.

They are best viewed as a complement to thoughtful design, not a substitute for it.

Confounding Adjustment Also Matters for Predictive AI

Even when the primary goal is prediction, confounding still matters.

Why?

Because predictive models trained on observational data may learn:

treatment patterns,
surveillance patterns,
prescribing habits,
or healthcare access differences,

rather than the substantive signal analysts think they are modeling.

This can create:

misleading feature importance,
brittle deployment,
unfair subgroup behavior,
and spurious policy recommendations.

This is especially important in pharmacovigilance, comparative effectiveness, and observational healthcare AI.

Good predictive performance does not automatically mean the model learned the right structure.

Fair AI in Observational Settings Requires Bias Awareness

In observational healthcare AI, fairness is not only about subgroup error rates.

It is also about whether the model reproduces biased treatment or documentation patterns that arose from confounded data.

For example:

a model may appear to “predict” adverse events,
but may partly be predicting who gets monitored more closely,
who gets treated earlier,
or who has more complete records.

This is why confounding adjustment and bias awareness matter even in ML settings that are not framed explicitly as causal.

Observational data embed human and system processes. Models can learn those processes unless the analyst thinks carefully about them.

A DAG-First Workflow Often Improves Real-World Evidence Analysis

A strong RWE workflow often looks like this:

define the treatment and outcome clearly
draw a DAG representing the assumed causal structure
identify a valid adjustment set
check variable availability and timing
perform regression or standardization adjustment
assess sensitivity to unmeasured confounding

This is a much stronger workflow than:

“throw all available covariates into the model and hope for the best.”

DAG-based thinking forces the analyst to make causal assumptions explicit before adjustment begins.

That is usually a major improvement in transparency.

A Practical Checklist for Applied Work

Before claiming a bias-adjusted effect from RWE, ask:

What are the likely confounders?
Have I represented the assumed structure with a DAG?
Am I adjusting only for appropriate pre-treatment variables?
Would regression adjustment alone be sufficient, or should standardization also be used?
Could collider or mediator adjustment create bias?
How sensitive is the result to possible unmeasured confounding?
Would an E-value or other sensitivity metric clarify robustness?

These questions often determine whether the analysis is truly causal or only cosmetically adjusted.

Where This Shows Up in AI/ML

Collider bias is one of the least-recognized failure modes in trauma AI: training a mortality model on DoDTR records conditions the sample on a collider — injury severe enough to enter the registry — which induces spurious associations between predictors that vanish when the model is applied to a broader casualty population. A model trained only on patients who reached a Role 2 or Role 3 facility will learn correlations between, say, mechanism of injury and hemorrhagic shock that reflect the selection process, not the underlying biology. Epic’s Sepsis Prediction Model has been criticized for similar issues: training on hospitalized patients whose admission was itself a downstream consequence of the predictors. When collider-selected training data is used to build a model intended for deployment across a wider population, apparent validation metrics overstate real-world performance.

Closing: Confounding Adjustment Is the Price of Interpretability in RWE

Real-world evidence is attractive because it reflects actual care and actual populations. But that realism comes with imbalance, selection, and bias.

Confounding adjustment is one of the main tools that makes observational evidence more interpretable.

DAGs help identify what should be adjusted for. Regression and standardization help construct fairer comparisons. E-values help communicate how vulnerable the result may still be to hidden confounding.

Confounding adjustment matters because in real-world data, the observed association often reflects who got treated as much as what treatment did.

📚 Go Deeper: Causal Inference Toolkit

This post is part of the Causal Inference Toolkit — a companion reference with DAG-based adjustment set templates, standardization code, E-value calculations, and collider/mediator caution guides for real-world evidence analyses.

→ Open the Causal Inference Toolkit

Series Callout

Note

This post is part of a broader Advanced Topics in Applied Statistics for AI and Clinical Decision-Making Series:

Missing data methods
Imputation techniques
Sensitivity analysis for missing data
Causal inference methods
Propensity score methods
Instrumental variables
Confounding and bias adjustment in RWE
Target trial emulation
Meta-analysis and evidence synthesis
External validity and generalizability in RWE

Series: Advanced Statistics

← Instrumental Variables: Uncovering Hidden Causes in ML | Emulating Trials with Real Data: A Game-Changer for AI Evidence →

References

Greenland, Sander, Judea Pearl, and James M. Robins. 1999. “Causal Diagrams for Epidemiologic Research.” Epidemiology 10 (1): 37–48. https://doi.org/10.1097/00001648-199901000-00008.

Hernán, Miguel A., Sonia Hernández-Dı́az, Martha M. Werler, and Allen A. Mitchell. 2002. “Causal Knowledge as a Prerequisite for Confounding Evaluation: An Application to Birth Defects Epidemiology.” American Journal of Epidemiology 155 (2): 176–84. https://doi.org/10.1093/aje/155.2.176.

VanderWeele, Tyler J., and Peng Ding. 2017. “Sensitivity Analysis in Observational Research: Introducing the e-Value.” Annals of Internal Medicine 167 (4): 268–74. https://doi.org/10.7326/M16-2607.