Real-Life Testing: Pragmatic Trials for Practical AI

Design of Experiments

A practical introduction to pragmatic trials, explanatory versus effectiveness thinking, cluster randomization, and real-world evaluation of healthcare AI.

Published

May 1, 2026

Modified

June 9, 2026

Executive Summary

A trial can be internally rigorous and still tell us too little about what happens in routine care.

That is the core motivation for pragmatic trials.

Traditional explanatory trials are often designed to answer:

can this intervention work under ideal conditions?

Pragmatic trials shift the question toward:

does this intervention work in real practice, with real patients, real clinicians, and real systems?

That difference matters.

Pragmatic trials typically aim to:

embed research in ordinary care,
reduce unnecessary exclusions,
use clinically meaningful outcomes,
and reflect the messy implementation conditions that shape real-world effectiveness.

This explanatory-versus-pragmatic contrast is central to modern trial design frameworks and is commonly discussed alongside tools such as PRECIS-2 for making design intent explicit (Ford and Norrie 2016; Loudon et al. 2015).

This matters in both biostatistics and AI/ML.

In biostatistics, pragmatic trials help bridge efficacy to effectiveness. In AI/ML, they are especially important for testing whether a model, alerting system, workflow tool, or decision aid still performs usefully once deployed in operational care rather than in a curated development environment.

This post introduces:

what pragmatic trials are,
how they differ from explanatory trials,
why routine-care embedding matters,
when cluster randomization becomes useful,
and why pragmatic evaluation is essential for scalable healthcare AI.

Pragmatic trials matter because many interventions look promising in controlled settings, but only real-life testing shows whether they remain useful once the messiness of actual care is allowed back in.

1. Pragmatic Trials Begin with a Different Question

The key question in a pragmatic trial is not simply:

can this intervention work under tightly controlled conditions?

It is:

does this intervention improve outcomes when implemented in ordinary practice?

That shift sounds small, but it changes the entire design philosophy.

A pragmatic trial is usually less focused on maximizing internal cleanliness at all costs and more focused on preserving the operational realities of routine care.

That means the design is often more tolerant of:

heterogeneous patients,
variable adherence,
clinician discretion,
site-to-site practice variation,
and real implementation constraints.

This is not weaker science. It is science answering a different question (Ford and Norrie 2016; Loudon et al. 2015).

3. Routine-Care Embedding Is One of the Main Design Features

A pragmatic trial often tries to embed the intervention directly into normal care delivery.

That may mean:

using existing EHR workflows,
enrolling broad patient populations,
minimizing extra study visits,
collecting outcomes from routine records,
and allowing normal clinicians rather than research specialists to deliver the intervention.

The idea is that the trial should resemble real implementation as closely as possible.

This is especially relevant in AI/ML because many digital or decision-support interventions interact directly with routine clinical systems.

If the study environment is too artificial, the result may not tell us much about real deployment.

4. Fewer Exclusions Usually Improve Real-World Relevance

A major feature of pragmatic trials is that they tend to minimize unnecessary exclusions.

Traditional explanatory trials may exclude participants because of:

comorbidity,
age,
polypharmacy,
variable adherence likelihood,
or operational complexity.

That can improve internal control, but it may also produce a trial population that looks very different from real practice.

Pragmatic trials often accept more heterogeneity because they are asking whether the intervention works in the population that would actually receive it.

This improves applicability, even if it sometimes increases noise.

5. A Broad Eligibility Example Makes the Pragmatic Logic Concrete

Suppose a healthcare system wants to evaluate an AI-guided medication adherence intervention.

An explanatory version might include only:

adults aged 40–65
with one condition
on one medication
at one academic site
with strong digital engagement

A pragmatic version might instead include:

all adults meeting a basic clinical eligibility rule
across multiple clinics
with ordinary variation in adherence, literacy, comorbidity, and follow-up patterns

The second design is messier. But it is often much closer to the real deployment population.

That is the pragmatic tradeoff.

6. Pragmatic Trials Often Use Outcomes That Matter Operationally

Pragmatic trials tend to prioritize outcomes that matter to patients, clinicians, and systems in routine care.

Examples include:

hospitalization
emergency visits
treatment discontinuation
symptom burden
workflow burden
clinician uptake
time to action
or all-cause utilization

These may differ from tightly controlled surrogate endpoints used in explanatory trials.

The key principle is that a pragmatic outcome should reflect whether the intervention improves care in a way that would matter outside the trial.

This is especially important in AI studies, where a model can improve a technical metric without improving actual clinical or operational outcomes.

7. A Simple Pragmatic Trial Scenario Helps Frame the Analysis

To illustrate, we will simulate a healthcare-system style pragmatic trial comparing:

usual care
versus an AI-assisted outreach intervention

The outcome will be a binary event such as hospitalization within follow-up.

The key idea is that the intervention is deployed in routine settings with multiple clinics and varying baseline risk.

library(dplyr)
library(tibble)
library(ggplot2)

n_clusters <- 12
patients_per_cluster <- 80

prag_df <- expand.grid(
  clinic = paste0("Clinic_", 1:n_clusters),
  patient = 1:patients_per_cluster
) |>
  tibble::as_tibble() |>
  dplyr::mutate(
    age = rnorm(dplyr::n(), mean = 63, sd = 12),
    severity = rnorm(dplyr::n(), mean = 0, sd = 1),
    comorbidity = rnorm(dplyr::n(), mean = 0, sd = 1)
  )

Now assign intervention at the clinic level, as might happen in a system-wide workflow rollout.

set.seed(20260316)

intervention_map <- tibble::tibble(
  clinic = unique(prag_df$clinic),
  intervention = sample(c(0, 1), size = n_clusters, replace = TRUE)
)

prag_df <- prag_df |>
  dplyr::left_join(intervention_map, by = "clinic") |>
  dplyr::mutate(
    clinic_effect = as.numeric(as.factor(clinic)) / 10,
    event_prob = plogis(-2.4 + 0.5 * intervention + 0.9 * severity + 0.8 * comorbidity + 0.02 * age + clinic_effect),
    outcome = rbinom(dplyr::n(), size = 1, prob = event_prob)
  )

prag_df |>
  dplyr::summarise(
    n = dplyr::n(),
    event_rate = mean(outcome)
  )

# A tibble: 1 × 2
      n event_rate
  <int>      <dbl>
1   960      0.448

This gives a pragmatic-style clustered implementation setting.

8. Pragmatic Trials Often Need Cluster Randomization

In many real-world settings, the intervention is not naturally assigned at the individual level.

For example:

one clinic gets a decision-support tool
one ward uses a new workflow
one practice adopts a digital alert
one hospital unit gets a staffing or implementation intervention

In those cases, cluster randomization is often more realistic than individual randomization (Donner and Klar 2004; Ford and Norrie 2016).

Cluster randomization assigns groups rather than individuals.

This is especially common in pragmatic trials because routine-care interventions are often delivered at the level of:

clinic
unit
hospital
physician
or practice

That makes the design operationally feasible, even though it introduces correlation within clusters.

9. Cluster Randomization Improves Feasibility, but Changes the Analysis

Once a trial is randomized by clinic or site, outcomes within the same cluster are no longer independent.

Patients within a clinic may resemble one another because of:

shared staff,
shared workflows,
shared implementation quality,
or shared local population structure.

This means the analysis must account for clustering.

Ignoring the cluster structure can lead to:

underestimated standard errors,
overly optimistic significance,
and misleading inference.

That is why pragmatic trials often need cluster-aware methods.

10. A Simple Cluster-Level Summary Is Often a Good First Check

Before fitting models, it is often useful to summarize outcomes at the cluster level.

cluster_tbl <- prag_df |>
  dplyr::group_by(clinic, intervention) |>
  dplyr::summarise(
    n = dplyr::n(),
    event_rate = mean(outcome),
    .groups = "drop"
  )

cluster_tbl

# A tibble: 12 × 4
   clinic    intervention     n event_rate
   <fct>            <dbl> <int>      <dbl>
 1 Clinic_1             1    80      0.338
 2 Clinic_2             0    80      0.438
 3 Clinic_3             0    80      0.362
 4 Clinic_4             0    80      0.338
 5 Clinic_5             1    80      0.488
 6 Clinic_6             0    80      0.45 
 7 Clinic_7             1    80      0.5  
 8 Clinic_8             0    80      0.375
 9 Clinic_9             1    80      0.55 
10 Clinic_10            0    80      0.475
11 Clinic_11            0    80      0.438
12 Clinic_12            1    80      0.625

This helps show whether outcome rates differ across clinics and whether the intervention contrast appears consistent or highly variable by site.

That variability is a major part of pragmatic evidence.

11. Pragmatic Trials Accept Real-World Noise Rather Than Designing It Away

One of the defining features of pragmatic trials is that they allow much of the natural variation of real care to remain.

That means the study may include:

nonadherence,
treatment delays,
site heterogeneity,
incomplete uptake,
and co-interventions.

Traditional explanatory designs often try to suppress these features.

Pragmatic designs often leave them in because they are part of the real implementation question.

This is one reason pragmatic trials may look noisier. But that noise is often exactly what makes the result more relevant.

12. Intention-to-Treat Thinking Is Especially Important Here

Because pragmatic trials preserve real-world variation, intention-to-treat reasoning becomes especially important.

The intervention is often evaluated according to assignment, even when:

implementation is imperfect,
uptake is incomplete,
clinicians override the tool,
or patients do not fully adhere.

This is not a flaw. It reflects the pragmatic estimand:

what happens when this strategy is implemented in routine care, not only when everyone follows it perfectly?

That is usually the decision-relevant question for healthcare systems.

13. A Simple Cluster-Aware Model Can Reflect the Design

For a binary outcome, one pragmatic approach is a mixed-effects logistic model with a clinic-level random intercept.

required_pkgs <- c("lme4")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
]

if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

fit_prag <- lme4::glmer(
  outcome ~ intervention + age + severity + comorbidity + (1 | clinic),
  data = prag_df,
  family = binomial()
)

summary(fit_prag)

This is not the only valid approach, but it makes the cluster structure explicit.

14. Pragmatic Trials Are Especially Relevant for AI Interventions

Pragmatic trial design is extremely important in healthcare AI because many AI interventions are not drugs. They are:

alerts,
triage systems,
decision-support tools,
prioritization models,
documentation aids,
or workflow-integrated recommendations.

These interventions often behave differently in deployment than in sandbox testing.

For example, a model may show excellent discrimination in retrospective validation but have weak real-world impact because:

clinicians ignore it,
alert fatigue develops,
the workflow is disrupted,
or the population differs from the training data.

Pragmatic trials are often the right design for evaluating whether the AI system actually improves care when embedded in practice.

15. The Difference Between Efficacy and Implementation Is Central for AI

A model can be technically accurate and still operationally ineffective.

That is one of the key reasons pragmatic trials matter for AI.

The real question is often not:

can the model classify well in a held-out dataset?

It is:

does this system improve decisions, outcomes, or workflow when used under routine conditions?

That is a pragmatic trial question.

This is one reason AI evaluation increasingly needs study-design language, not only model-performance language.

16. The Salford Lung Study Is a Useful Mental Model for Pragmatic Thinking

A helpful way to frame pragmatic trials is to think of them as studies that preserve routine care instead of replacing it with a highly curated research environment.

A classic example often discussed in this context is the Salford Lung Study, which emphasized real-world care conditions, broad inclusion, and routine practice integration.

The important lesson is not to memorize one specific case. It is to see the design principle:

when the goal is real-world applicability, the trial should resemble the real world closely enough for the result to travel there.

That is the pragmatic design mindset.

17. External Validity Is One of the Major Strengths of Pragmatic Trials

Traditional efficacy trials often maximize internal validity at the expense of real-world applicability.

Pragmatic trials shift that balance.

Because they include:

broader patients,
ordinary clinicians,
routine systems,
and more realistic adherence patterns,

their results often have stronger external relevance for real deployment.

That is especially important for AI/ML interventions intended for healthcare systems, where transportability across messy environments matters at least as much as ideal-condition performance.

18. Pragmatic Trials Are Not Looser Science — They Are Different Science (Ford and Norrie 2016; Loudon et al. 2015)

A common misunderstanding is that pragmatic trials are simply less rigorous because they are less controlled.

That is not the right view.

Pragmatic trials answer different questions. They prioritize:

applicability,
implementation reality,
system-level usefulness,
and ordinary-care effectiveness.

That may reduce some forms of internal control, but it can increase decision relevance substantially.

The right comparison is not:

strict versus sloppy.

It is:

efficacy-focused versus effectiveness-focused.

That is a much better way to teach the design tradeoff.

19. A Practical Checklist for Applied Work

Before designing or interpreting a pragmatic trial, ask:

Is the study trying to estimate efficacy or effectiveness?
Are the eligibility criteria broad enough to reflect the real target population?
Is the intervention embedded in ordinary care rather than an artificial research environment?
Should randomization occur at the patient level or the cluster level?
Are the outcomes meaningful in routine practice?
Is the analysis aligned with intention-to-treat logic?
Will the result still matter once the intervention is deployed at scale?

These questions usually matter more than whether the trial looks neat on paper.

Where This Shows Up in AI/ML

The PRECIS-2 framework for assessing how pragmatic a trial is maps directly onto the gap between clinical AI validation studies and deployment reality. Most published clinical AI validations are highly explanatory on every PRECIS-2 domain: controlled patient populations, single academic sites, curated and complete data, supervised implementation with expert oversight — none of which exist at a forward operating base or a rural MTF running MHS GENESIS. The PRECIS-2 score of the original validation study is a reasonable first-order predictor of how much performance will degrade at deployment: the more explanatory the validation, the larger the gap. Epic-embedded sepsis prediction tools validated at academic centers have shown 30–50% drops in positive predictive value when deployed at community hospitals, and the same dynamic applies to any DoDTR-derived model fielded outside the large trauma centers that dominate the registry.

Closing: Pragmatic Trials Test Whether an Intervention Survives Contact with Reality

Pragmatic trials matter because many interventions look better in controlled settings than they do in the real world.

By embedding the study in routine care, minimizing unnecessary exclusions, and using outcomes that matter operationally, pragmatic trials test whether an intervention remains valuable once real patients, real clinicians, and real systems are allowed back into the picture.

That makes them especially important for modern healthcare AI, where deployment conditions often determine success as much as algorithm quality does.

Pragmatic trials matter because the most useful evidence is not only about whether an intervention can work, but whether it still works when routine care stops protecting it from reality.

📚 Go Deeper: Real-World Evidence Toolkit

This post is part of the Real-World Evidence Toolkit — a companion reference with pragmatic trial reporting templates, cluster randomization analysis scaffolds, PRECIS-2 checklist guidance, and effectiveness evaluation workflows.

→ Open the Real-World Evidence Toolkit

Series Callout

Note

This post concludes the series on Design of Experiments for Biostats and AI/ML:

Randomized controlled trials
Observational study designs
Cross-sectional study design
Longitudinal study design
Sample size and power analysis
Stratification and randomization techniques
Blinding and placebo controls
Adaptive study designs
Pragmatic trials
Quasi-experimental designs

Series: Design of Experiments

← Flexible Trials: Adaptive Designs in the AI Fast Lane | Quasi Magic: Causal Insights Without RCTs for AI →

References

Donner, Allan, and Neil Klar. 2004. “Pitfalls of and Controversies in Cluster Randomization Trials.” American Journal of Public Health 94 (3): 416–22. https://doi.org/10.2105/AJPH.94.3.416.

Ford, Ian, and John Norrie. 2016. “Pragmatic Trials.” The New England Journal of Medicine 375 (5): 454–63. https://doi.org/10.1056/NEJMra1510059.

Loudon, Kirsty, Shaun Treweek, Frank Sullivan, Peter Donnan, Kate E. Thorpe, and Merrick Zwarenstein. 2015. “The PRECIS-2 Tool: Designing Trials That Are Fit for Purpose.” BMJ 350: h2147. https://doi.org/10.1136/bmj.h2147.

Executive Summary

1. Pragmatic Trials Begin with a Different Question

2. Efficacy and Effectiveness Are Related, but Not Identical

Efficacy

Effectiveness

3. Routine-Care Embedding Is One of the Main Design Features

4. Fewer Exclusions Usually Improve Real-World Relevance

5. A Broad Eligibility Example Makes the Pragmatic Logic Concrete

6. Pragmatic Trials Often Use Outcomes That Matter Operationally

7. A Simple Pragmatic Trial Scenario Helps Frame the Analysis

8. Pragmatic Trials Often Need Cluster Randomization

9. Cluster Randomization Improves Feasibility, but Changes the Analysis

10. A Simple Cluster-Level Summary Is Often a Good First Check

11. Pragmatic Trials Accept Real-World Noise Rather Than Designing It Away

12. Intention-to-Treat Thinking Is Especially Important Here

13. A Simple Cluster-Aware Model Can Reflect the Design

14. Pragmatic Trials Are Especially Relevant for AI Interventions

15. The Difference Between Efficacy and Implementation Is Central for AI

16. The Salford Lung Study Is a Useful Mental Model for Pragmatic Thinking

17. External Validity Is One of the Major Strengths of Pragmatic Trials

18. Pragmatic Trials Are Not Looser Science — They Are Different Science (Ford and Norrie 2016; Loudon et al. 2015)

19. A Practical Checklist for Applied Work

Closing: Pragmatic Trials Test Whether an Intervention Survives Contact with Reality

Series Callout

References