Study Design Foundations: RCTs, Observational & Cross-Sectional

Design of Experiments — Lecture 1 of 4

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

Analysis cannot rescue a bad design. The question is settled before the first data point is collected.

What You’ll Learn Today

Post 01 Randomized Controlled Trials

  • Why randomization works
  • Parallel vs. crossover designs
  • Allocation ratios
  • Estimating the treatment effect

Post 02 Observational Study Design

  • Cohort vs. case-control
  • Prospective vs. retrospective
  • Crude vs. adjusted comparisons
  • Bias and confounding by design

Post 03 Cross-Sectional Design

  • Snapshot vs. movie
  • Prevalence estimation
  • Association ≠ causation
  • When cross-sectional is right

Part 1

Randomized Controlled Trials

The design that breaks the assignment–outcome link

Why Randomization Works

n <- 400
# Unmeasured confounder U affects both severity and treatment choice
U <- rnorm(n)

# Observational: sicker patients more likely to get treatment
trt_obs  <- rbinom(n, 1, plogis(-0.5 + 1.8*U))
y_obs    <- 5 - 2*trt_obs + 2*U + rnorm(n, 0, 1.5)

# RCT: treatment fully independent of U
trt_rct  <- rbinom(n, 1, 0.5)
y_rct    <- 5 - 2*trt_rct + 2*U + rnorm(n, 0, 1.5)

bind_rows(
  tibble(Design="Observational", Effect=coef(lm(y_obs ~ trt_obs))["trt_obs"]),
  tibble(Design="RCT",           Effect=coef(lm(y_rct ~ trt_rct))["trt_rct"]),
  tibble(Design="Truth",         Effect=-2)
) |> mutate(Design=factor(Design, levels=c("Observational","RCT","Truth"))) |>
  ggplot(aes(Design, Effect, fill=Design)) +
  geom_col(width=0.5, alpha=0.85) +
  geom_hline(yintercept=-2, linetype=2, color="#e63946") +
  scale_fill_manual(values=c("#e63946","#0891b2","#253554")) +
  labs(title="Randomization recovers the true effect; observational data is biased by U",
       y="Estimated treatment effect") +
  theme_di() + theme(legend.position="none")

Randomization makes \(P(A=1 \mid U) = 0.5\) regardless of U. The groups become comparable on everything — measured and unmeasured — in expectation.

RCT Designs: Parallel vs. Crossover

Parallel-group (standard)

  • Each patient receives one treatment
  • Groups compared between patients
  • Simple, unambiguous causal target
  • Requires larger n

Best for:

  • Treatments with lasting effects
  • Surgical interventions
  • When carryover is a concern

Crossover

  • Each patient receives both treatments (sequenced)
  • Patient is their own control
  • Fewer patients needed
  • Washout period required

Best for:

  • Chronic stable conditions
  • Pharmacokinetic studies
  • When carryover can be designed away

Trauma caveat: Crossover is almost never appropriate — injuries are acute, interventions are not reversible, and conditions are not stable.

Simulating an RCT: Allocation and Estimation

n <- 300
df_rct <- tibble(
  id        = 1:n,
  treated   = rep(c(1,0), each=n/2)[sample(n)],   # 1:1 allocation
  baseline  = rnorm(n, 28, 10),                     # ISS at enrollment
  outcome   = 12 - 2.5*treated + 0.2*baseline + rnorm(n, 0, 3)
)

# Unadjusted estimate (valid because randomized)
fit_unadj <- lm(outcome ~ treated, data=df_rct)
# Covariate-adjusted (more precise, not more valid)
fit_adj   <- lm(outcome ~ treated + baseline, data=df_rct)

broom::tidy(fit_unadj, conf.int=TRUE) |>
  bind_rows(broom::tidy(fit_adj, conf.int=TRUE)) |>
  filter(term=="treated") |>
  mutate(Model = c("Unadjusted", "Adjusted for baseline ISS"),
         across(where(is.numeric), round, 3)) |>
  select(Model, estimate, std.error, conf.low, conf.high)
# A tibble: 2 × 5
  Model                     estimate std.error conf.low conf.high
  <chr>                        <dbl>     <dbl>    <dbl>     <dbl>
1 Unadjusted                   -1.85     0.425    -2.68     -1.01
2 Adjusted for baseline ISS    -2.31     0.343    -2.98     -1.63

In an RCT, adjustment is optional — it improves precision, not validity. In an observational study, adjustment is essential — and may still be insufficient.

Part 2

Observational Study Design

Structure and discipline when randomization is unavailable

The Observational Design Spectrum

The best design is the one that can actually answer the question — within real-world constraints of time, cost, and ethics.

Cohort vs. Case-Control Logic

Cohort study

  • Start with exposure status
  • Follow forward → observe outcome
  • Estimates: RR, incidence rates, hazard ratios
  • Efficient when outcome is common
  • Prospective: best; retrospective: feasible
Exposed    →  Outcome?
Unexposed  →  Outcome?

Case-control study

  • Start with outcome status (cases vs. controls)
  • Look backward → exposure history
  • Estimates: OR (approximates RR when rare)
  • Efficient when outcome is rare
  • Control selection is the critical design decision
Cases    →  Were they exposed?
Controls →  Were they exposed?

DoDTR application: Studying rare complications (e.g., post-traumatic pulmonary embolism) → case-control. Studying treatment trajectories for all penetrating abdominal trauma → retrospective cohort.

The Bias That Design Controls

n <- 500
df_cohort <- tibble(
  exposed   = rbinom(n, 1, 0.45),
  confounder = rnorm(n),   # e.g., injury severity
  outcome   = rbinom(n, 1, plogis(-2 + 0.8*exposed + 1.2*confounder))
)

# Crude association
crude <- coef(glm(outcome ~ exposed, family=binomial, data=df_cohort))["exposed"]
# Adjusted
adj   <- coef(glm(outcome ~ exposed + confounder, family=binomial, data=df_cohort))["exposed"]

tibble(
  Estimator = c("Crude OR (exp)", "Adjusted OR (exp)", "True log-OR"),
  Value = round(c(exp(crude), exp(adj), exp(0.8)), 3)
)
# A tibble: 3 × 2
  Estimator         Value
  <chr>             <dbl>
1 Crude OR (exp)     2.52
2 Adjusted OR (exp)  3.09
3 True log-OR        2.23

Design controls selection bias (who enters the study). Analysis controls confounding (what else explains the association). Both must be addressed — design errors cannot be fixed in analysis.

Part 3

Cross-Sectional Design

Fast, practical — and easy to misinterpret

What a Cross-Sectional Study Can and Cannot Tell You

# Prevalence estimation from a cross-sectional survey
n <- 600
df_cs <- tibble(
  age_group = sample(c("18-30","31-45","46-60","60+"), n, replace=TRUE,
                     prob=c(0.3, 0.3, 0.25, 0.15)),
  site      = sample(c("Role 2","Role 3","Role 4"), n, replace=TRUE),
  cpg_compliant = rbinom(n, 1,
    ifelse(age_group=="18-30", 0.82,
    ifelse(age_group=="31-45", 0.76,
    ifelse(age_group=="46-60", 0.70, 0.65))))
)

df_cs |>
  group_by(age_group, site) |>
  summarise(prevalence = mean(cpg_compliant), n=n(), .groups="drop") |>
  ggplot(aes(age_group, prevalence, fill=site)) +
  geom_col(position="dodge", alpha=0.85) +
  geom_hline(yintercept=0.80, linetype=2, color="#e63946") +
  scale_fill_manual(values=c("#2563eb","#0891b2","#8b5cf6")) +
  scale_y_continuous(labels=scales::percent_format()) +
  labs(title="CPG compliance prevalence by age group and care level — cross-sectional snapshot",
       x="Age group", y="Compliance rate", fill="Care level") +
  theme_di()

What this tells us: Prevalence at a point in time, stratified by observable characteristics.

What this cannot tell us: Whether compliance caused outcomes, or which direction the association runs.

Reverse Causation — The Cross-Sectional Trap

The temporality problem:

Cross-sectional studies measure exposure and outcome simultaneously. We cannot determine which came first.

Classic example: A cross-sectional survey finds that hospitalized patients have higher medication use. Does medication cause hospitalization — or does hospitalization cause more medications to be prescribed?

In registry analytics: A cross-sectional look at documentation completeness and patient severity may show high-severity patients have less complete records — but this reflects documentation capacity under pressure, not a causal relationship.

When cross-sectional is the right design:

  • Estimating prevalence of a condition
  • Screening and needs assessment
  • Baseline characterization before a longitudinal study
  • Rapid situational awareness (e.g., theater health surveys)
  • Hypothesis generation for causal studies

Lecture 1 — Key Takeaways

RCTs

  • Randomization breaks the confounder–treatment link
  • Parallel-group: standard; crossover: rare in acute settings
  • Covariate adjustment improves precision, not validity
  • Internal validity is highest — external validity requires thought

Observational Design

  • Cohort: starts with exposure, estimates RR
  • Case-control: starts with outcome, efficient for rare events
  • Confounding = design + analysis problem
  • Selection bias = design problem only

Cross-Sectional

  • Snapshot: estimates prevalence efficiently
  • Cannot establish temporality
  • Reverse causation is always a threat
  • Best for hypothesis generation, not causal confirmation

The meta-lesson: Design determines what questions can be answered. No amount of analytical sophistication converts a cross-sectional association into a causal estimate. Choose the design that matches the question.

Coming Up: Lecture 2

Longitudinal Design, Sample Size & Randomization Strategy

Posts 04, 05 & 06:

  • Longitudinal studies — trajectories, attrition, mixed-effects thinking
  • Sample size & power — effect sizes, simulation-based planning, sensitivity
  • Randomization & stratification — block randomization, minimization, balance diagnostics