Why Statistical Significance Is a Terrible Stopping Rule
Executive Summary
“Statistically significant” is not a conclusion.
It is a checkpoint—and a weak one.
Yet in clinical research, operational analytics, and applied machine learning, statistical significance is often treated as: - the final verdict, - the justification for action, - or proof that a model “works.”
This post explains why that mindset fails—and what to replace it with if your goal is better decisions, not just publishable results.
What Statistical Significance Actually Answers
A p-value answers this question:
If the null hypothesis were true, how surprising would these data be?
That is a much narrower question than most readers think it is, and it does not by itself measure effect importance, clinical relevance, or decision value (Wasserstein and Lazar 2016; Amrhein et al. 2019).
That’s it.
It does not answer: - Is the effect important? - Is it meaningful? - Is it actionable? - Is it stable? - Is it worth acting on?
Yet those are the questions people think they’re answering.
Why Significance Feels So Comforting
Statistical significance persists because it offers: - a binary decision, - a familiar threshold, - a sense of objectivity, - and an illusion of certainty.
In messy, high-stakes environments, that certainty is emotionally appealing.
Unfortunately, it’s often false.
Sample Size Turns Trivial Effects Into “Discoveries”
One of the most dangerous properties of p-values:
With enough data, almost anything becomes statistically significant.
simulate_p <- function(n, effect) {
x <- rnorm(n)
y <- x * effect + rnorm(n)
summary(lm(y ~ x))$coefficients[2, 4]
}
sapply(c(50, 500, 5000), simulate_p, effect = 0.05)The effect didn’t change. The story did.
Large N converts noise into confidence (Cohen 1994).
A Narrow Confidence Interval Can Still Be Useless
You’ve probably seen results like this:
- p < 0.001
- 95% CI: (1.01, 1.03)
Technically impressive. Clinically irrelevant.
Statistical precision does not equal practical importance.
R², Significance, and the Illusion of Explanation
A model can be:
- statistically significant,
- precisely estimated,
- and explain almost none of the outcome.
This is not rare. It’s common.
summary(lm(outcome ~ predictor, data = data))$r.squaredA low R² doesn’t mean the model is “wrong.” But significance alone does not mean it’s useful.
Binary Thinking Breaks Clinical Decisions
Clinical decisions are rarely yes/no.
They are:
- risk-based,
- threshold-driven,
- context-dependent,
- asymmetric in harm.
Statistical significance enforces a binary worldview onto a probabilistic reality.
What Decision-Makers Actually Want to Know
Replace:
“Is this significant?”
With:
- How big is the effect?
- How uncertain is it?
- How often would it matter?
- Who benefits?
- What’s the cost of being wrong?
Significance answers none of these.
Better Questions (and Better Metrics)
8.1 Effect Sizes and Uncertainty
confint(lm(outcome ~ predictor, data = data))Magnitude beats stars.
8.2 Probabilities of Clinically Meaningful Effects
mean(effect_samples > clinically_relevant_threshold)This answers:
“How likely is this to matter?”
8.3 Calibration and Error Tradeoffs
In prediction:
- false negatives often dominate harm,
- thresholds matter more than p-values.
library(yardstick)
roc_auc(preds, truth, .pred)
pr_auc(preds, truth, .pred)When Significance Actively Causes Harm
Statistical significance becomes dangerous when it:
- justifies premature deployment,
- silences uncertainty,
- masks bias,
- discourages sensitivity analysis,
- or shuts down further questioning.
At that point, it’s no longer a statistical tool—it’s rhetoric.
Sensitivity Analysis Beats Significance Every Time
A result that:
- survives reasonable perturbations,
- holds across definitions,
- and degrades gracefully,
…is more trustworthy than one with a tiny p-value.
Robustness > rejection.
What to Say Instead (Reviewer-Safe Language)
Instead of:
“The result was statistically significant.”
Try:
- “The estimated effect was modest but consistent.”
- “Uncertainty remains substantial.”
- “Results were sensitive to assumptions X and Y.”
- “Clinical relevance depends on context.”
These statements build credibility, not weakness.
A Better Stopping Rule
A defensible analysis stops when:
- the decision context is addressed,
- uncertainty is quantified,
- sensitivity is explored,
- and limitations are explicit.
Not when a threshold is crossed.
AI model comparison studies routinely report that model A is “significantly better” than model B based on a p-value for the AUC difference, when the actual AUC gap is 0.003 — a difference with no clinical meaning in any trauma decision support context. In DoDTR-based model development, a 2-point AUC improvement that achieves statistical significance in a 10,000-record registry is clinically meaningless if neither model is well-calibrated for the high-acuity presentations — penetrating injury, hemorrhagic shock, blast polytrauma — that drive mortality. A MAVEN alert that fires 3% more accurately on average but remains poorly calibrated for the cases that matter most has not improved care. Statistical significance tells you the signal is real; it says nothing about whether the signal is large enough to act on.
Closing: Significance Is a Tool, Not a Verdict
Statistical significance is not useless.
It’s just insufficient.
When it becomes the stopping rule, it (Wasserstein et al. 2019; Cumming 2014):
- narrows thinking,
- overstates certainty,
- and weakens decisions.
Good applied statistics doesn’t ask:
“Is this significant?”
It asks:
“What does this mean, for whom, and at what cost?”
That’s where rigor actually lives.
Series Callout
This post is part of a broader Trauma Registry and Other Topics Series:
- Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
- Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
- Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
- Missing Data Is the Real Model: Practical Strategies in R
- From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
- Why Statistical Significance Is a Terrible Stopping Rule
- Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
- Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
- How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
- Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
- Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
- Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
- Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
- Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
- MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality