---
title: "Design of Experiments — Master Speaker Notes"
subtitle: "Instructor Teaching Guide · 4-Lecture Series"
author: "Jonathan D. Stallings, PhD, MS"
date: "Summer 2026"
format:
  html:
    toc: true
    toc-depth: 3
    toc-title: "Lecture Navigator"
    number-sections: true
    theme: cosmo
---

> **How to use this guide.** Instructor-facing notes for the 4-lecture Design of Experiments series. Audience: clinicians and analysts who design or evaluate clinical research, including registry-based studies and pragmatic trials.

---

# Lecture 1 — Study Design Foundations

**Posts covered:** 01 (RCT Design), 02 (Observational Designs), 03 (Cross-Sectional Design)

## Teaching strategy

The central message of this lecture is that design determines what questions you can answer — before a single data point is collected. Clinicians who read literature and order tests are implicitly using study design logic every day. This lecture makes that logic explicit.

Start with the design hierarchy: RCT at the top, observational at the bottom. Then immediately complicate it: "Does the design hierarchy always map to the quality-of-evidence hierarchy? Not when the RCT has a surrogate outcome, poor external validity, or an industry funder with a financial interest in the result." Design and execution quality are both necessary.

## Key talking points

**Slide: Randomization — Why It Works**
Randomization doesn't guarantee the groups are equal. It guarantees that the groups are equal *in expectation* — and that any remaining imbalance is due to chance, not systematic differences. This is why we can make causal claims from RCTs without propensity models.

**Slide: The Randomization Simulation**
Watch the simulation carefully with the audience: how quickly do the groups balance? With n=20, some trials will have meaningful imbalance. With n=300, imbalance is rare and small. This directly motivates stratified randomization and minimization for smaller trials.

**Slide: Observational Study Design Spectrum**
Map each design to a clinical question type:
- RCT: Does this treatment cause this outcome?
- Cohort: What happens to patients with this exposure over time?
- Case-control: What were patients with this rare outcome exposed to?
- Cross-sectional: What is the current prevalence of this condition?

**Slide: Cross-Sectional Temporal Limitation**
Cross-sectional studies cannot establish temporal order — which is required for causal inference. "We found X and Y are correlated in this cross-sectional sample" is not "X causes Y." This is one of the most common mistakes in clinical literature interpretation.

## Timing
- RCT design: 20 min
- Observational design spectrum: 20 min
- Cross-sectional design and CPG compliance example: 15 min

## Discussion prompt
"A cross-sectional survey finds that surgeons with more than 20 years of experience have lower complication rates. What three alternative explanations (other than 'experience reduces complications') could explain this finding? Which study design would you need to distinguish between them?"

---

# Lecture 2 — Longitudinal Design, Power & Randomization

**Posts covered:** 04 (Longitudinal Design), 05 (Attrition and MNAR), 06 (Power and Sample Size)

## Teaching strategy

Longitudinal data is the gold standard for many clinical questions — it tracks the same patients over time, allowing you to observe trajectories and control for stable patient-level characteristics. The challenge is attrition: patients drop out, and the dropout is almost never random.

Power calculation is the section most clinicians encounter in research planning but rarely understand deeply. The goal of this lecture is to demystify: a power calculation is a model, and like all models, its output depends critically on its inputs. Most power calculations are optimistic because the effect size assumption is optimistic.

## Key talking points

**Slide: Longitudinal Trajectory Plots**
Show the spaghetti plots. Each line is a patient's trajectory over time. The key question: does the treatment group's trajectory differ from the control group's? Mixed models (repeated measures ANOVA or LME) partition: within-patient variance (how much does a patient's values change over time?) vs. between-patient variance (how different are patients from each other at baseline?).

**Slide: MNAR Attrition**
The attrition simulation demonstrates the most dangerous pattern: patients who deteriorate are more likely to drop out. The observed data (among completers) shows a better trajectory than the true trajectory. This is survivorship bias in longitudinal form. Intent-to-treat analysis is the principled defense — analyze all randomized patients regardless of dropout, using imputation for missing outcomes.

**Slide: Power Curves — Effect Size Sensitivity**
The most important slide in this section. Show the power curve across a range of effect sizes. The question to ask: "How confident are you in your assumed effect size?" Most registry-based estimates of expected effects are from observational studies — and observational estimates are typically larger than RCT estimates due to confounding. Build in conservatism.

**Slide: The Power Sensitivity Surface**
The 3D surface (effect size × standard deviation × required n) shows that power calculations are jointly sensitive to multiple assumptions. A study "powered for 80%" based on optimistic effect size and favorable SD assumptions may actually have 40–50% power when deployed.

## Timing
- Longitudinal design and trajectories: 20 min
- Attrition and MNAR: 15 min
- Power and sample size: 20 min
- Power sensitivity demonstration: 5 min

## Common questions
- *"What alpha should I use?"* 0.05 is conventional. 0.025 if you're doing two-sided tests with multiple outcomes. Pre-specify and justify.
- *"What power should I target?"* 80% is conventional but arbitrary. For high-stakes clinical decisions, 90% is often appropriate. Higher power = larger sample.
- *"Can I calculate power after the study?"* Post-hoc power is statistically problematic — it's circular reasoning. Report effect size and CI instead.

---

# Lecture 3 — Trial Integrity: Blinding, Placebo, and Group Sequential Design

**Posts covered:** 07 (Placebo and Blinding), 08 (Trial Integrity), 09 (Pragmatic Trial Design)

## Teaching strategy

Trial integrity is about protecting the RCT's causal validity. The design can be perfect, but if execution is sloppy — unblinded assessors, protocol deviations, outcome measurement inconsistency — the causal inference collapses.

The expectation bias simulation is the lecture's most memorable moment: show how unblinded raters systematically inflate treatment effects. The effect can be large enough to flip a null finding to positive. This is not hypothetical — expectation bias has been documented in clinical trials of surgical procedures, behavioral interventions, and open-label drug studies.

## Key talking points

**Slide: Expectation Bias — The Simulation**
Show the simulation to the audience before explaining what it models. Ask them to explain the pattern they see. Then reveal: the bias comes from the assessor knowing the assignment. The lesson: blinding the assessor protects the measurement, even when you can't blind the patient or the clinician.

**Slide: Group-Sequential Design — O'Brien-Fleming Boundaries**
The group-sequential design addresses the multiple comparison problem in adaptive trials. Each interim analysis is a test; doing multiple tests inflates the Type I error rate unless you correct for it. O'Brien-Fleming boundaries are conservative early (requiring very strong evidence to stop early for benefit) and become more permissive late. This is intentional: early stopping for benefit tends to produce inflated effect estimates.

**Slide: Cluster Randomization — DEFF and Required n**
In cluster trials, patients within a cluster are correlated — they share a provider, a facility, a protocol. The design effect (DEFF = 1 + (m-1) × ICC) shows how much your required sample size must increase to account for this correlation. With ICC = 0.10 and cluster size m = 20, DEFF = 2.9. You need nearly 3x as many patients as an individually randomized trial.

**Slide: Pragmatic vs. Explanatory Trials**
The PRECIS-2 wheel maps trial design on a continuum from explanatory (tight controls, restricted population, ideal conditions) to pragmatic (broad eligibility, routine care conditions, real-world delivery). Most clinical practice questions require pragmatic evidence — but most landmark trials produce explanatory evidence. This gap is why RCT results often don't replicate in practice.

## Timing
- Blinding and expectation bias: 20 min
- Group sequential design: 15 min
- Cluster trials and ICC: 15 min
- Pragmatic design: 10 min

---

# Lecture 4 — Quasi-Experimental Designs & Evidence Synthesis

**Posts covered:** 10 (Quasi-Experimental), Synthesis

## Teaching strategy

Quasi-experimental designs are the tools for when randomization is impossible — policy changes, natural experiments, before-after comparisons. Each design exploits a different type of variation that is "as good as random" under specific assumptions.

The synthesis at the end of this lecture is the payoff for the entire series: the design space scatter plot that maps all design types across the dimensions of confounding control and feasibility. The question for every research problem: where in this space does your question live, and what's the strongest design you can execute?

## Key talking points

**Slide: Interrupted Time Series**
ITS is the workhorse of policy evaluation: did this intervention change the level or slope of the outcome time series? The key threat to validity is confounding by concurrent events — something else changed at the same time. A comparison group that doesn't receive the intervention (but shares concurrent events) is the solution.

**Slide: Difference-in-Differences**
DiD assumes parallel trends: in the absence of the intervention, treated and control groups would have followed parallel trajectories. Test this visually in the pre-intervention period. If the pre-intervention trends diverge, the parallel trends assumption is likely violated.

**Slide: Regression Discontinuity**
The most credible quasi-experimental design when it applies: units just above and below a threshold are comparable, and treatment assignment is determined by the threshold. Local average treatment effect at the threshold. Key threat: manipulation of the running variable near the threshold.

**Slide: The Design Space**
When selecting a design, ask: What is the strongest design this question can support? What are the feasibility constraints? What assumptions are required, and how plausible are they? Every design has a threat profile — the question is whether those threats can be argued against convincingly in your specific context.

## Timing
- ITS and DiD: 25 min
- Regression discontinuity: 15 min
- Design synthesis and space: 15 min
- Q&A: 5 min

## Series-Level Discussion Questions

1. You want to evaluate the effect of a new TCCC protocol on limb salvage. You cannot randomize. Rank the following designs and justify: retrospective cohort, ITS with a comparison site, a cluster-randomized trial at Role 2 facilities.

2. A power calculation for a trial of resuscitation timing requires assuming 15% mortality in the control arm and an 8% absolute reduction in the treatment arm. Where did these numbers come from, and how sensitive is the required n to each assumption?

3. A pragmatic trial enrolled patients "as treated" without blinding. Outcomes were assessed by the treating team. What types of bias are introduced? How would you quantify the likely direction and magnitude?

4. A difference-in-differences analysis compares tourniquet use at bases that received TCCC training vs. those that did not. The pre-intervention period spans 2 years; the post-intervention period spans 1 year. What evidence would you need to believe the parallel trends assumption holds?

5. A cross-sectional study finds that 23% of combat casualties meet CPG criteria for early resuscitation. Can you calculate an incidence rate from this number? What additional information would you need?