Why Statistical Significance Is a Terrible Stopping Rule

Trauma Registry and Other Topics

Statistical Inference

Why p-values are a weak stopping rule in clinical, operational, and AI-enabled analyses, and what to report instead.

Published

April 1, 2024

Modified

June 9, 2026

Executive Summary

“Statistically significant” is not a conclusion.
It is a checkpoint—and a weak one.

Yet in clinical research, operational analytics, and applied machine learning, statistical significance is often treated as: - the final verdict, - the justification for action, - or proof that a model “works.”

This post explains why that mindset fails—and what to replace it with if your goal is better decisions, not just publishable results.

What Statistical Significance Actually Answers

A p-value answers this question:

If the null hypothesis were true, how surprising would these data be?

That is a much narrower question than most readers think it is, and it does not by itself measure effect importance, clinical relevance, or decision value (Wasserstein and Lazar 2016; Amrhein et al. 2019).

That’s it.

It does not answer: - Is the effect important? - Is it meaningful? - Is it actionable? - Is it stable? - Is it worth acting on?

Yet those are the questions people think they’re answering.

Why Significance Feels So Comforting

Statistical significance persists because it offers: - a binary decision, - a familiar threshold, - a sense of objectivity, - and an illusion of certainty.

In messy, high-stakes environments, that certainty is emotionally appealing.

Unfortunately, it’s often false.

Sample Size Turns Trivial Effects Into “Discoveries”

One of the most dangerous properties of p-values:

With enough data, almost anything becomes statistically significant.

simulate_p <- function(n, effect) {
  x <- rnorm(n)
  y <- x * effect + rnorm(n)
  summary(lm(y ~ x))$coefficients[2, 4]
}

sapply(c(50, 500, 5000), simulate_p, effect = 0.05)

The effect didn’t change. The story did.

Large N converts noise into confidence (Cohen 1994).

A Narrow Confidence Interval Can Still Be Useless

You’ve probably seen results like this:

p < 0.001
95% CI: (1.01, 1.03)

Technically impressive. Clinically irrelevant.

Statistical precision does not equal practical importance.

R², Significance, and the Illusion of Explanation

A model can be:

statistically significant,
precisely estimated,
and explain almost none of the outcome.

This is not rare. It’s common.

summary(lm(outcome ~ predictor, data = data))$r.squared

A low R² doesn’t mean the model is “wrong.” But significance alone does not mean it’s useful.

Binary Thinking Breaks Clinical Decisions

Clinical decisions are rarely yes/no.

They are:

risk-based,
threshold-driven,
context-dependent,
asymmetric in harm.

Statistical significance enforces a binary worldview onto a probabilistic reality.

What Decision-Makers Actually Want to Know

Replace:

“Is this significant?”

With:

How big is the effect?
How uncertain is it?
How often would it matter?
Who benefits?
What’s the cost of being wrong?

Significance answers none of these.

Better Questions (and Better Metrics)

8.1 Effect Sizes and Uncertainty

confint(lm(outcome ~ predictor, data = data))

Magnitude beats stars.

8.2 Probabilities of Clinically Meaningful Effects

mean(effect_samples > clinically_relevant_threshold)

This answers:

“How likely is this to matter?”

8.3 Calibration and Error Tradeoffs

In prediction:

false negatives often dominate harm,
thresholds matter more than p-values.

library(yardstick)

roc_auc(preds, truth, .pred)
pr_auc(preds, truth, .pred)

When Significance Actively Causes Harm

Statistical significance becomes dangerous when it:

justifies premature deployment,
silences uncertainty,
masks bias,
discourages sensitivity analysis,
or shuts down further questioning.

At that point, it’s no longer a statistical tool—it’s rhetoric.

Sensitivity Analysis Beats Significance Every Time

A result that:

survives reasonable perturbations,
holds across definitions,
and degrades gracefully,

…is more trustworthy than one with a tiny p-value.

Robustness > rejection.

What to Say Instead (Reviewer-Safe Language)

Instead of:

“The result was statistically significant.”

Try:

“The estimated effect was modest but consistent.”
“Uncertainty remains substantial.”
“Results were sensitive to assumptions X and Y.”
“Clinical relevance depends on context.”

These statements build credibility, not weakness.

A Better Stopping Rule

A defensible analysis stops when:

the decision context is addressed,
uncertainty is quantified,
sensitivity is explored,
and limitations are explicit.

Not when a threshold is crossed.

Where This Shows Up in AI/ML

AI model comparison studies routinely report that model A is “significantly better” than model B based on a p-value for the AUC difference, when the actual AUC gap is 0.003 — a difference with no clinical meaning in any trauma decision support context. In DoDTR-based model development, a 2-point AUC improvement that achieves statistical significance in a 10,000-record registry is clinically meaningless if neither model is well-calibrated for the high-acuity presentations — penetrating injury, hemorrhagic shock, blast polytrauma — that drive mortality. A MAVEN alert that fires 3% more accurately on average but remains poorly calibrated for the cases that matter most has not improved care. Statistical significance tells you the signal is real; it says nothing about whether the signal is large enough to act on.

Closing: Significance Is a Tool, Not a Verdict

Statistical significance is not useless.

It’s just insufficient.

When it becomes the stopping rule, it (Wasserstein et al. 2019; Cumming 2014):

narrows thinking,
overstates certainty,
and weakens decisions.

Good applied statistics doesn’t ask:

“Is this significant?”

It asks:

“What does this mean, for whom, and at what cost?”

That’s where rigor actually lives.

Series Callout

Note

This post is part of a broader Trauma Registry and Other Topics Series:

Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
Missing Data Is the Real Model: Practical Strategies in R
From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
Why Statistical Significance Is a Terrible Stopping Rule
Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality

Series: Trauma Registry & Outcomes

← From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself | Hierarchical Models Are Not Optional in Healthcare (Here’s Why) →

References

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567 (7748): 305–7. https://doi.org/10.1038/d41586-019-00857-9.

Cohen, Jacob. 1994. “The Earth Is Round (p < .05).” American Psychologist 49 (12): 997–1003. https://doi.org/10.1037/0003-066X.49.12.997.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘p < 0.05’.” The American Statistician 73 (sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913.