Why Statistical Significance Is a Terrible Stopping Rule

Trauma Registry and Other Topics
Statistical Inference
Why p-values are a weak stopping rule in clinical, operational, and AI-enabled analyses, and what to report instead.
Published

April 1, 2024

Modified

June 9, 2026

Executive Summary

“Statistically significant” is not a conclusion.
It is a checkpoint—and a weak one.

Yet in clinical research, operational analytics, and applied machine learning, statistical significance is often treated as: - the final verdict, - the justification for action, - or proof that a model “works.”

This post explains why that mindset fails—and what to replace it with if your goal is better decisions, not just publishable results.


What Statistical Significance Actually Answers

A p-value answers this question:

If the null hypothesis were true, how surprising would these data be?

That is a much narrower question than most readers think it is, and it does not by itself measure effect importance, clinical relevance, or decision value (Wasserstein and Lazar 2016; Amrhein et al. 2019).

That’s it.

It does not answer: - Is the effect important? - Is it meaningful? - Is it actionable? - Is it stable? - Is it worth acting on?

Yet those are the questions people think they’re answering.


Why Significance Feels So Comforting

Statistical significance persists because it offers: - a binary decision, - a familiar threshold, - a sense of objectivity, - and an illusion of certainty.

In messy, high-stakes environments, that certainty is emotionally appealing.

Unfortunately, it’s often false.


Sample Size Turns Trivial Effects Into “Discoveries”

One of the most dangerous properties of p-values:

With enough data, almost anything becomes statistically significant.

simulate_p <- function(n, effect) {
  x <- rnorm(n)
  y <- x * effect + rnorm(n)
  summary(lm(y ~ x))$coefficients[2, 4]
}

sapply(c(50, 500, 5000), simulate_p, effect = 0.05)

The effect didn’t change. The story did.

Large N converts noise into confidence (Cohen 1994).


A Narrow Confidence Interval Can Still Be Useless

You’ve probably seen results like this:

  • p < 0.001
  • 95% CI: (1.01, 1.03)

Technically impressive. Clinically irrelevant.

Statistical precision does not equal practical importance.


R², Significance, and the Illusion of Explanation

A model can be:

  • statistically significant,
  • precisely estimated,
  • and explain almost none of the outcome.

This is not rare. It’s common.

summary(lm(outcome ~ predictor, data = data))$r.squared

A low R² doesn’t mean the model is “wrong.” But significance alone does not mean it’s useful.


Binary Thinking Breaks Clinical Decisions

Clinical decisions are rarely yes/no.

They are:

  • risk-based,
  • threshold-driven,
  • context-dependent,
  • asymmetric in harm.

Statistical significance enforces a binary worldview onto a probabilistic reality.


What Decision-Makers Actually Want to Know

Replace:

“Is this significant?”

With:

  • How big is the effect?
  • How uncertain is it?
  • How often would it matter?
  • Who benefits?
  • What’s the cost of being wrong?

Significance answers none of these.


Better Questions (and Better Metrics)

8.1 Effect Sizes and Uncertainty

confint(lm(outcome ~ predictor, data = data))

Magnitude beats stars.


8.2 Probabilities of Clinically Meaningful Effects

mean(effect_samples > clinically_relevant_threshold)

This answers:

“How likely is this to matter?”


8.3 Calibration and Error Tradeoffs

In prediction:

  • false negatives often dominate harm,
  • thresholds matter more than p-values.
library(yardstick)

roc_auc(preds, truth, .pred)
pr_auc(preds, truth, .pred)

When Significance Actively Causes Harm

Statistical significance becomes dangerous when it:

  • justifies premature deployment,
  • silences uncertainty,
  • masks bias,
  • discourages sensitivity analysis,
  • or shuts down further questioning.

At that point, it’s no longer a statistical tool—it’s rhetoric.


Sensitivity Analysis Beats Significance Every Time

A result that:

  • survives reasonable perturbations,
  • holds across definitions,
  • and degrades gracefully,

…is more trustworthy than one with a tiny p-value.

Robustness > rejection.


What to Say Instead (Reviewer-Safe Language)

Instead of:

“The result was statistically significant.”

Try:

  • “The estimated effect was modest but consistent.”
  • “Uncertainty remains substantial.”
  • “Results were sensitive to assumptions X and Y.”
  • “Clinical relevance depends on context.”

These statements build credibility, not weakness.


A Better Stopping Rule

A defensible analysis stops when:

  • the decision context is addressed,
  • uncertainty is quantified,
  • sensitivity is explored,
  • and limitations are explicit.

Not when a threshold is crossed.


NoteWhere This Shows Up in AI/ML

AI model comparison studies routinely report that model A is “significantly better” than model B based on a p-value for the AUC difference, when the actual AUC gap is 0.003 — a difference with no clinical meaning in any trauma decision support context. In DoDTR-based model development, a 2-point AUC improvement that achieves statistical significance in a 10,000-record registry is clinically meaningless if neither model is well-calibrated for the high-acuity presentations — penetrating injury, hemorrhagic shock, blast polytrauma — that drive mortality. A MAVEN alert that fires 3% more accurately on average but remains poorly calibrated for the cases that matter most has not improved care. Statistical significance tells you the signal is real; it says nothing about whether the signal is large enough to act on.

Closing: Significance Is a Tool, Not a Verdict

Statistical significance is not useless.

It’s just insufficient.

When it becomes the stopping rule, it (Wasserstein et al. 2019; Cumming 2014):

  • narrows thinking,
  • overstates certainty,
  • and weakens decisions.

Good applied statistics doesn’t ask:

“Is this significant?”

It asks:

“What does this mean, for whom, and at what cost?”

That’s where rigor actually lives.



Series Callout

Note

This post is part of a broader Trauma Registry and Other Topics Series:

  • Why Most Clinical Models Fail in the Real World (and How to Fix Them in R)
  • Audit-Ready Applied Statistics: How to Make Your R Analysis Defensible
  • Bayesian Models for Clinicians Who Hate Math (But Love Good Decisions)
  • Missing Data Is the Real Model: Practical Strategies in R
  • From Registry to Knowledge: How to Analyze Messy Trauma Data Without Lying to Yourself
  • Why Statistical Significance Is a Terrible Stopping Rule
  • Hierarchical Models Are Not Optional in Healthcare (Here’s Why)
  • Prediction ≠ Causation: How to Use Each Correctly in Applied Statistics
  • How to Evaluate Models When the Outcome Is Rare (and Lives Are at Stake)
  • Building Clinical Decision Support That Doesn’t Collapse Under Scrutiny
  • Rare Event Modeling in Clinical Prediction: Why 1% Outcomes Break Your Model (And What to Do in R)
  • Calibration Under Drift: How Clinical Models Become Confident and Wrong (And How to Monitor It in R)
  • Audit-Ready Bayesian Workflows: Why Transparency Is a Process, Not a Model Feature
  • Missing Data in Hierarchical Clinical Models: Why Structure Changes the Problem
  • MNAR Sensitivity Analysis for Applied Work: What to Do When Missingness Depends on Reality

References

Amrhein, Valentin, Sander Greenland, and Blake McShane. 2019. “Scientists Rise up Against Statistical Significance.” Nature 567 (7748): 305–7. https://doi.org/10.1038/d41586-019-00857-9.
Cohen, Jacob. 1994. “The Earth Is Round (p < .05).” American Psychologist 49 (12): 997–1003. https://doi.org/10.1037/0003-066X.49.12.997.
Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.
Wasserstein, Ronald L., Allen L. Schirm, and Nicole A. Lazar. 2019. “Moving to a World Beyond ‘p < 0.05’.” The American Statistician 73 (sup1): 1–19. https://doi.org/10.1080/00031305.2019.1583913.