Beyond p < 0.05: Hypothesis Testing in the Age of AI

Applied Statistics
Hypothesis Testing
A practical guide to null and alternative hypotheses, p-values, error rates, power, and why thresholded significance is too limited for modern analytic work.
Published

September 15, 2023

Modified

June 9, 2026

Executive Summary

Hypothesis testing is one of the most widely taught and most widely misunderstood tools in statistics (Casella and Berger 2002; Wasserstein and Lazar 2016).

At its best, it offers a structured way to ask whether observed data are compatible with a null claim. At its worst, it becomes a ritual:

  • compute a p-value,
  • compare it to 0.05,
  • declare “significant” or “not significant,”
  • and stop thinking.

That approach was always too simplistic. In the age of AI, it is even less adequate.

Modern analytical work requires more than thresholding p-values. It requires understanding:

  • what the null and alternative hypotheses actually mean,
  • how Type I and Type II errors trade off,
  • why p-values do not measure effect size or practical importance,
  • and how power and sample size shape what a study can detect.

This matters in both biostatistics and machine learning.

In biostatistics, hypothesis testing appears in:

  • trials,
  • subgroup analyses,
  • safety monitoring,
  • and baseline comparisons.

In AI/ML, it appears in:

  • A/B testing,
  • model comparisons,
  • feature evaluation,
  • and repeated validation workflows.

A p-value is not a truth detector. It is one piece of evidence, conditional on a model, a null hypothesis, and a set of assumptions.


Hypothesis Testing Is a Framework for Structured Skepticism

Hypothesis testing begins with a question.

Not:

“Is this effect real?”

but rather:

“If a specific null claim were true, how surprising would the observed data be?”

This is a different and more limited question than many people realize.

The hypothesis testing framework is built around two competing statements:

  • the null hypothesis,
  • the alternative hypothesis.

The procedure then evaluates whether the data are sufficiently inconsistent with the null to justify rejecting it (Casella and Berger 2002).

This can be useful. But it only works well when the hypotheses, assumptions, and interpretation are clear.


The Null and Alternative Hypotheses Are Not Just Formalities

The null hypothesis usually represents a status quo or no-effect claim.

For example:

\[ H_0: \mu_1 = \mu_2 \]

The alternative hypothesis represents a competing claim.

For example:

\[ H_A: \mu_1 \neq \mu_2 \]

or perhaps a directional version such as:

\[ H_A: \mu_1 > \mu_2 \]

The choice matters.

A two-sided alternative asks whether there is any difference. A one-sided alternative asks whether the difference goes in a particular direction.

This is not merely cosmetic. It changes the test, the interpretation, and the evidentiary threshold.

In practice, weakly justified one-sided testing is one of the easiest ways to overstate findings.


A p-value Is Often Misinterpreted

The p-value is probably the most abused number in applied statistics (Wasserstein and Lazar 2016).

A p-value is:

the probability of observing data as extreme as, or more extreme than, what we saw, assuming the null hypothesis is true.

That is all.

A p-value is not:

  • the probability that the null hypothesis is true,
  • the probability that the results happened “by chance,”
  • the probability the alternative is correct,
  • a measure of effect size,
  • a measure of scientific importance.

This is why a very small p-value can still correspond to a trivial effect in a large sample. And a large p-value can still be compatible with a meaningful but imprecisely estimated effect in a small sample.


Type I and Type II Errors Define the Basic Risks

Every hypothesis test operates under uncertainty, which means it carries error risk.

Type I Error

A Type I error occurs when we reject the null hypothesis even though it is true.

This is the false positive rate.

It is typically controlled at:

\[ \alpha = 0.05 \]

though this threshold is conventional, not sacred.

Type II Error

A Type II error occurs when we fail to reject the null hypothesis even though the alternative is true.

This is the false negative rate.

It is typically denoted:

\[ \beta \]

Power

Power is:

\[ 1 - \beta \]

It is the probability of detecting an effect if the effect truly exists at the specified size.

This is one reason p-values alone are insufficient (Wasserstein and Lazar 2016). A non-significant result may reflect absence of effect, inadequate power, or noisy data.


A t-test Is One of the Simplest Working Examples

A t-test compares means across groups.

We will simulate a small example with two groups.

library(dplyr)
library(tibble)
library(ggplot2)

ttest_df <- tibble::tibble(
  group = rep(c("Control", "Treatment"), each = 60),
  outcome = c(
    rnorm(60, mean = 10, sd = 3),
    rnorm(60, mean = 11.2, sd = 3)
  )
)

ttest_df |>
  dplyr::group_by(group) |>
  dplyr::summarise(
    n = dplyr::n(),
    mean = mean(outcome),
    sd = sd(outcome),
    .groups = "drop"
  )
# A tibble: 2 × 4
  group         n  mean    sd
  <chr>     <int> <dbl> <dbl>
1 Control      60  9.79  3.49
2 Treatment    60 11.1   2.78

Now conduct the t-test.

ttest_res <- t.test(outcome ~ group, data = ttest_df)
ttest_res

    Welch Two Sample t-test

data:  outcome by group
t = -2.3153, df = 112.43, p-value = 0.02241
alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
95 percent confidence interval:
 -2.4765335 -0.1925318
sample estimates:
  mean in group Control mean in group Treatment 
               9.794092               11.128624 

This output provides:

  • a test statistic,
  • degrees of freedom,
  • a p-value,
  • a confidence interval,
  • and group mean information.

That is already a reminder that the p-value should not be interpreted alone. The confidence interval often communicates much more.


Visualization Usually Improves Interpretation

Before and after formal testing, it helps to visualize the data.

ggplot2::ggplot(ttest_df, ggplot2::aes(x = group, y = outcome)) +
  ggplot2::geom_boxplot() +
  ggplot2::labs(
    title = "Two-Group Outcome Comparison",
    x = NULL,
    y = "Outcome"
  ) +
  ggplot2::theme_minimal()

A visualization helps clarify:

  • overlap,
  • spread,
  • outliers,
  • and whether the tested difference is practically meaningful.

This is especially important in applied settings where statistical significance can exaggerate the importance of small effects.


Chi-square Tests Extend the Logic to Categorical Data

Hypothesis testing is not limited to continuous outcomes.

For categorical data, one common tool is the chi-square test.

Suppose we want to test whether response status differs by group.

chi_df <- tibble::tibble(
  group = rep(c("Model A", "Model B"), times = c(120, 120)),
  correct = c(
    rbinom(120, size = 1, prob = 0.68),
    rbinom(120, size = 1, prob = 0.80)
  )
)

chi_tbl <- table(chi_df$group, chi_df$correct)
chi_tbl
         
           0  1
  Model A 43 77
  Model B 23 97

Now perform the chi-square test.

chisq.test(chi_tbl)

    Pearson's Chi-squared test with Yates' continuity correction

data:  chi_tbl
X-squared = 7.5444, df = 1, p-value = 0.00602

This kind of test is common in:

  • contingency table analysis,
  • response/nonresponse comparisons,
  • feature-outcome association screening,
  • and A/B testing with binary outcomes.

But again, a statistically significant association does not automatically imply practical importance.


p-values Do Not Replace Effect Sizes

One of the most persistent problems in applied work is the fixation on whether p is below 0.05.

That framing ignores effect size.

For the t-test example, a mean difference can be estimated directly.

ttest_df |>
  dplyr::group_by(group) |>
  dplyr::summarise(mean_outcome = mean(outcome), .groups = "drop")
# A tibble: 2 × 2
  group     mean_outcome
  <chr>            <dbl>
1 Control           9.79
2 Treatment        11.1 

We can also compute a standardized effect size such as Cohen’s (d).

control_vals <- ttest_df |>
  dplyr::filter(group == "Control") |>
  dplyr::pull(outcome)

treat_vals <- ttest_df |>
  dplyr::filter(group == "Treatment") |>
  dplyr::pull(outcome)

pooled_sd <- sqrt(
  ((length(control_vals) - 1) * var(control_vals) +
     (length(treat_vals) - 1) * var(treat_vals)) /
    (length(control_vals) + length(treat_vals) - 2)
)

cohens_d <- (mean(treat_vals) - mean(control_vals)) / pooled_sd
cohens_d
[1] 0.4227174

This helps separate:

  • statistical detectability,
  • from practical magnitude.

That distinction matters enormously in both biostats and ML.


Confidence Intervals Often Tell a Better Story Than p-values Alone

Confidence intervals are often more informative than a binary significance label.

They communicate:

  • the estimated direction of effect,
  • the plausible range of values,
  • and the precision of the estimate.

For many readers, a confidence interval answers the practical question better than a p-value (Wasserstein and Lazar 2016).

The t-test output already included an interval for the difference in means. That interval often deserves more interpretive weight than the thresholded p-value.

In model evaluation, the same applies. Intervals around accuracy, AUC, calibration, or error rates often tell a more honest story than a single metric.


Power Determines What a Study Can Reasonably Detect

A non-significant result is hard to interpret without thinking about power.

Power depends on:

  • the true effect size,
  • sample size,
  • variability,
  • significance threshold,
  • and test choice.

We can estimate power for a two-sample t-test.

power.t.test(
  n = 60,
  delta = 1.2,
  sd = 3,
  sig.level = 0.05,
  type = "two.sample",
  alternative = "two.sided"
)

     Two-sample t test power calculation 

              n = 60
          delta = 1.2
             sd = 3
      sig.level = 0.05
          power = 0.5843645
    alternative = two.sided

NOTE: n is number in *each* group

This gives an approximate sense of whether the study is capable of detecting an effect of that size.

Low-power studies are not just inconvenient. They distort interpretation by making both positive and negative findings harder to trust.


Sample Size Planning Is an Ethical and Scientific Design Issue

Sample size is often treated as a technical afterthought, but it is really a design decision.

An underpowered study may:

  • miss meaningful effects,
  • waste resources,
  • and generate ambiguous conclusions.

An extremely large study may:

  • detect trivial effects,
  • overemphasize nominal significance,
  • and encourage overclaiming.

This is why sample size planning should be tied to:

  • plausible effect sizes,
  • decision relevance,
  • expected variability,
  • and the cost of Type I versus Type II errors.

In AI/ML experimentation, the same logic applies to online tests and validation studies.


Hypothesis Testing Still Matters in AI/ML

Some people frame hypothesis testing as an outdated concern compared with predictive modeling.

That is too simplistic.

Hypothesis testing still matters in AI/ML for problems such as:

  • A/B testing two models in production,
  • comparing feature sets,
  • evaluating whether a metric improved beyond noise,
  • testing subgroup differences in performance,
  • and screening whether observed improvements are likely to be sampling variation.

For example, if a new recommendation model appears 1 percent better than the old one, the right question is not only whether the metric increased, but whether the increase is stable, meaningful, and unlikely to reflect chance variation.

That is a testing problem.


Hypothesis Testing Can Also Be Misused in ML

At the same time, hypothesis testing can be badly misapplied in modern ML workflows.

Common mistakes include:

  • testing many features without multiplicity control,
  • declaring importance from univariate tests alone,
  • treating significance as predictive usefulness,
  • ignoring dependence across repeated experiments,
  • confusing validation noise with true improvement.

A feature can be statistically significant but practically useless for prediction. Conversely, a feature may not be individually significant in a small sample yet still contribute meaningfully in a multivariable predictive model.

This is one reason hypothesis testing should not be confused with model building.


Beyond p < 0.05 Means Thinking in Errors, Uncertainty, and Decisions

A mature use of hypothesis testing asks more than:

“Was the p-value below 0.05?”

It asks:

  • What was the null hypothesis?
  • Was the alternative scientifically meaningful?
  • What are the practical consequences of Type I and Type II errors?
  • How large is the effect?
  • How precise is the estimate?
  • Was the study adequately powered?
  • Would the result matter in deployment or decision-making?

This is especially important in high-stakes settings.

A small p-value does not decide whether a finding matters. People do.


A Small ML-Style Example: Comparing Model Accuracy

Suppose two models are evaluated on repeated resamples, producing two sets of accuracy estimates.

We can illustrate the temptation and the caution.

model_perf_df <- tibble::tibble(
  model = rep(c("Model A", "Model B"), each = 40),
  accuracy = c(
    rnorm(40, mean = 0.78, sd = 0.03),
    rnorm(40, mean = 0.80, sd = 0.03)
  )
)

model_perf_df |>
  dplyr::group_by(model) |>
  dplyr::summarise(
    mean_accuracy = mean(accuracy),
    sd_accuracy = sd(accuracy),
    .groups = "drop"
  )
# A tibble: 2 × 3
  model   mean_accuracy sd_accuracy
  <chr>           <dbl>       <dbl>
1 Model A         0.783      0.0290
2 Model B         0.796      0.0230
t.test(accuracy ~ model, data = model_perf_df)

    Welch Two Sample t-test

data:  accuracy by model
t = -2.1781, df = 74.101, p-value = 0.03258
alternative hypothesis: true difference in means between group Model A and group Model B is not equal to 0
95 percent confidence interval:
 -0.02442401 -0.00108666
sample estimates:
mean in group Model A mean in group Model B 
            0.7834150             0.7961704 

This kind of comparison can be useful, but it also raises deeper questions:

  • are the resamples independent,
  • is the metric distribution well behaved,
  • is the effect practically meaningful,
  • and does significance translate into better operational performance?

In other words, testing is necessary, but not sufficient.


A Practical Checklist for Applied Work

Before reporting a hypothesis test, ask:

  • What exactly is the null hypothesis?
  • What result would count as practically meaningful?
  • What are the risks of Type I and Type II errors here?
  • Is the p-value being interpreted correctly?
  • What is the effect size?
  • What does the confidence interval say?
  • Was the analysis adequately powered?
  • Am I using significance as a shortcut for scientific or predictive relevance?

These questions greatly improve the quality of interpretation.

NoteWhere This Shows Up in AI/ML

A/B testing for clinical AI deployment — comparing model-assisted vs. standard-of-care decisions in a prospective trial — is formally a hypothesis test, and the same problems that plague p-values in research (underpowering, multiple comparisons, stopping at significance) appear in model evaluation: a study powered to detect a 5% reduction in missed sepsis diagnoses will almost never be powered for subgroup effects in the trauma population, yet subgroup failure is the most clinically dangerous failure mode. AI benchmark overfitting directly mirrors the replication crisis — models fine-tuned and re-evaluated on the same held-out set until they achieve a publishable AUC are doing implicit multiple comparisons without correction, inflating reported performance in exactly the way that produced a decade of irreproducible psychological findings.


Closing: Hypothesis Testing Still Matters, but Only If We Use It Well

Hypothesis testing remains useful because it gives structure to uncertainty.

It helps analysts ask whether observed differences are compatible with a null model. It formalizes false positive and false negative risk. It supports study planning through power and sample size logic.

But it becomes misleading when reduced to a ritual.

A p-value alone cannot tell us:

  • whether an effect matters,
  • whether a model is useful,
  • whether a feature improves prediction,
  • or whether a result should change practice.

Those judgments require effect sizes, intervals, design thinking, and domain context.

The real lesson of hypothesis testing in modern statistics and AI is not to abandon it, but to stop treating it like a binary oracle.


This post is part of the Trauma Registry Analytics Toolkit — a companion reference with hypothesis testing templates for multi-site registry data, effect size reporting, and power calculation scaffolds.

→ Open the Trauma Registry Analytics Toolkit


Series Callout

Note

This post is part of a broader Applied Statistics for AI and Clinical Decision-Making Series:

  • Probability fundamentals for machine learning
  • Random variables and expectation
  • Common probability distributions
  • Central Limit Theorem
  • Law of Large Numbers
  • Sampling methods for Biostats and ML
  • Hypothesis testing in the age of AI
  • Confidence intervals
  • Maximum likelihood estimation
  • Bayesian inference
  • Linear regression
  • Logistic regression
  • Generalized linear models
  • Analysis of variance
  • Principal component analysis
  • Cluster analysis
  • Time series analysis
  • Survival analysis
  • Non-parametric methods
  • Bias-variance tradeoff
  • Regularization
  • Cross-validation
  • Information theory
  • Optimization techniques
  • Linear algebra basics
  • Calculus for ML
  • Monte Carlo methods
  • Dimensionality curse and reduction techniques
  • Model evaluation metrics
  • Ensemble methods

References

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.