LLN in Action: Building Reliable ML Models from Noisy Data

Applied Statistics

Law of Large Numbers

An applied introduction to the Law of Large Numbers, convergence of sample means, and why larger samples stabilize estimates in statistics and machine learning.

Published

July 15, 2023

Modified

June 9, 2026

Executive Summary

The Law of Large Numbers is one of the simplest and most important ideas in probability (DeGroot and Schervish 2012; Wasserman 2004).

It formalizes a principle that many analysts intuitively trust:

as we observe more data, averages become more stable.

That idea is foundational in statistics, clinical research, and machine learning.

If a process has a true average outcome, then repeated observations from that process should eventually produce a sample average that gets close to the truth.

This is why larger studies tend to produce more reliable estimates. It is why model performance estimated on small samples can be unstable. And it is why training on more data often improves the credibility of empirical summaries.

In machine learning, the Law of Large Numbers helps explain why empirical risk can approximate population risk. In biostatistics, it helps explain why trial estimates stabilize as enrollment grows.

This post introduces the Law of Large Numbers through intuition, simulation, and applied examples.

We will focus on:

what the Law of Large Numbers says,
the difference between weak and strong forms,
how convergence looks in code,
and why the idea matters for reliable ML and clinical inference.

The Law of Large Numbers does not eliminate randomness. It makes randomness average out in predictable ways.

The Law of Large Numbers Explains Why Averages Settle Down

Individual observations are noisy.

A single patient outcome may differ from expectation. A single model prediction may be wrong. A single device reading may be erratic. A single trial participant may not reflect the broader population.

But when we average across many observations, the noise begins to cancel.

That is the intuition behind the Law of Large Numbers.

If observations are generated from a stable process with a finite mean, then the sample average tends to approach the true population mean as the sample size increases (DeGroot and Schervish 2012; Casella and Berger 2002).

This is one of the central reasons statistics works at all.

Without some form of averaging stability, inference would be much harder.

What the Law of Large Numbers Actually Says

Let (X_1, X_2, , X_n) be random variables with common mean ().

Define the sample mean as:

\[ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \]

The Law of Large Numbers says that as \(n\) grows, the sample mean \(\bar{X}_n\) converges to the true mean \(\mu\).

In plain language:

small samples can fluctuate a lot,
but large samples tend to average out those fluctuations.

This does not mean every large sample is perfect. It means the average becomes increasingly reliable in a probabilistic sense.

Weak vs. Strong LLN: Same Destination, Different Kind of Guarantee

There are two commonly discussed forms of the Law of Large Numbers.

Weak Law of Large Numbers

The weak LLN says that the sample mean converges to the true mean in probability.

That means:

for any fixed tolerance, the probability that the sample mean differs substantially from the true mean goes to zero as (n) increases.

This is a probabilistic guarantee.

Strong Law of Large Numbers

The strong LLN says that the sample mean converges to the true mean almost surely.

That is a stronger statement.

It means that with probability 1, the sequence of sample means eventually settles toward the true mean.

In practice, both tell a similar story for applied work:

more data makes averages more trustworthy.

But mathematically, the strong law gives a stronger mode of convergence.

The LLN Is About Convergence of Averages, Not Elimination of Variability

A common misunderstanding is that the Law of Large Numbers says randomness disappears.

It does not.

Randomness remains in the raw observations.

What changes is that the average becomes more stable.

This distinction matters.

A dataset of individual patient outcomes may remain heterogeneous even in a very large sample. A model may still make variable predictions case-by-case.

What stabilizes is the summary quantity: the sample mean, empirical proportion, or average loss.

That is why the LLN is so relevant to estimation and machine learning objectives.

A Simple Simulation Makes the LLN Visible

The easiest way to understand the LLN is to simulate repeated observations and track the running average.

We will begin with a simple example using coin flips coded as 1 for heads and 0 for tails.

The true mean is 0.5.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 5000

coin_df <- tibble::tibble(
  trial = 1:n,
  x = rbinom(n = n, size = 1, prob = 0.5)
) |>
  dplyr::mutate(
    running_mean = cumsum(x) / trial
  )

coin_df |>
  dplyr::slice_head(n = 10)

# A tibble: 10 × 3
   trial     x running_mean
   <int> <int>        <dbl>
 1     1     1        1    
 2     2     0        0.5  
 3     3     0        0.333
 4     4     0        0.25 
 5     5     1        0.4  
 6     6     1        0.5  
 7     7     0        0.429
 8     8     1        0.5  
 9     9     0        0.444
10    10     0        0.4

Now visualize the running mean.

ggplot2::ggplot(coin_df, ggplot2::aes(x = trial, y = running_mean)) +
  ggplot2::geom_line(linewidth = 0.6) +
  ggplot2::geom_hline(yintercept = 0.5, linetype = 2) +
  ggplot2::labs(
    title = "Law of Large Numbers: Running Mean of Bernoulli Trials",
    x = "Number of Trials",
    y = "Running Mean"
  ) +
  ggplot2::theme_minimal()

Early on, the running mean fluctuates substantially.

Later, it stabilizes around 0.5.

That is the LLN in action.

The LLN Works Beyond Coin Flips

The theorem is not limited to binary data.

It applies much more broadly whenever the required conditions hold.

To make that visible, consider a right-skewed continuous distribution such as the exponential.

Its raw values are noisy and asymmetric, but the running mean still stabilizes.

exp_df <- tibble::tibble(
  trial = 1:n,
  x = rexp(n = n, rate = 1)
) |>
  dplyr::mutate(
    running_mean = cumsum(x) / trial
  )

ggplot2::ggplot(exp_df, ggplot2::aes(x = trial, y = running_mean)) +
  ggplot2::geom_line(linewidth = 0.6) +
  ggplot2::geom_hline(yintercept = 1, linetype = 2) +
  ggplot2::labs(
    title = "Law of Large Numbers: Running Mean of Exponential Data",
    x = "Number of Observations",
    y = "Running Mean"
  ) +
  ggplot2::theme_minimal()

Even though the underlying distribution is highly skewed, the running average still approaches the true mean.

That is one reason the LLN is so powerful. It is not restricted to neat, symmetric data.

A Clinical Trial Interpretation: Event Rates Stabilize with Enrollment

The LLN is highly relevant in biostatistics because many clinical questions depend on estimated averages or proportions.

For example:

event rates,
response rates,
mean biomarker changes,
adverse event proportions.

Suppose the true adverse event rate is 20 percent.

At the beginning of a trial, the observed proportion may bounce around considerably. As enrollment grows, the observed event rate becomes more stable.

trial_n <- 1000

trial_df <- tibble::tibble(
  patient = 1:trial_n,
  adverse_event = rbinom(n = trial_n, size = 1, prob = 0.20)
) |>
  dplyr::mutate(
    observed_rate = cumsum(adverse_event) / patient
  )

ggplot2::ggplot(trial_df, ggplot2::aes(x = patient, y = observed_rate)) +
  ggplot2::geom_line(linewidth = 0.6) +
  ggplot2::geom_hline(yintercept = 0.20, linetype = 2) +
  ggplot2::labs(
    title = "Observed Adverse Event Rate as Enrollment Increases",
    x = "Enrolled Patients",
    y = "Observed Event Rate"
  ) +
  ggplot2::theme_minimal()

This is a useful way to explain to non-statistical audiences why early trial results should be interpreted cautiously.

Small samples can mislead. Larger samples stabilize.

The LLN Underlies Empirical Risk in Machine Learning

In machine learning, a model is often evaluated through an average loss function.

Conceptually, we would like to minimize the true population risk:

\[ R(f) = E[L(Y, f(X))] \]

But we do not observe the full population distribution.

Instead, we estimate risk using the empirical risk:

\[ \hat{R}_n(f) = \frac{1}{n} \sum_{i=1}^{n} L(Y_i, f(X_i)) \]

This is where the Law of Large Numbers matters.

As the sample size increases, the empirical average loss tends to approach the expected population loss, assuming the relevant conditions hold.

That is one reason empirical risk minimization is reasonable (Hastie et al. 2009; James et al. 2021).

It uses the sample average as a stand-in for the population target.

The LLN Is One Foundation for Consistent Estimators

An estimator is called consistent if it converges to the true parameter value as sample size grows (Casella and Berger 2002).

The LLN is one of the main reasons many simple estimators are consistent.

Examples include:

the sample mean for the population mean,
the sample proportion for a Bernoulli probability,
empirical average loss for expected loss.

This matters in supervised learning.

If training data are representative and large enough, averages computed from the sample can begin to approximate the true quantities we care about.

That is not a guarantee of perfect modeling. But it is a critical starting point.

The LLN Does Not Say Small Samples Are Useless

It is tempting to interpret the theorem as “just get a lot of data and the problem is solved.”

That is too simplistic.

The Law of Large Numbers does not solve:

bias,
confounding,
nonrepresentative sampling,
label noise,
model misspecification,
missing data,
dependence structures.

A large biased sample can converge to the wrong answer very precisely.

So the LLN supports reliability of averages under appropriate assumptions. It does not fix flawed data-generating processes.

This is an important lesson in both biostatistics and AI.

Convergence Can Be Slow or Fast Depending on the Problem

Not all averages stabilize at the same rate.

Convergence depends on factors such as:

variance,
tail behavior,
dependence,
and data quality.

Low-noise processes stabilize quickly.

High-variance or heavy-tailed processes may require far more observations before the running mean looks stable.

We can compare two examples.

compare_df <- tibble::tibble(
  trial = 1:n,
  low_var = rnorm(n, mean = 10, sd = 1),
  high_var = rnorm(n, mean = 10, sd = 5)
) |>
  dplyr::mutate(
    low_var_mean = cumsum(low_var) / trial,
    high_var_mean = cumsum(high_var) / trial
  ) |>
  tidyr::pivot_longer(
    cols = c(low_var_mean, high_var_mean),
    names_to = "series",
    values_to = "running_mean"
  )

ggplot2::ggplot(compare_df, ggplot2::aes(x = trial, y = running_mean)) +
  ggplot2::geom_line(linewidth = 0.6) +
  ggplot2::geom_hline(yintercept = 10, linetype = 2) +
  ggplot2::facet_wrap(~ series) +
  ggplot2::labs(
    title = "Convergence Depends on Variability",
    x = "Number of Observations",
    y = "Running Mean"
  ) +
  ggplot2::theme_minimal()

The target is the same in both cases, but the noisier process takes longer to settle.

LLN in Model Evaluation: Why Averaging Across Folds Helps

In applied ML, performance estimates based on a single split can be unstable.

That is one reason analysts use:

repeated train/test splits,
cross-validation,
repeated cross-validation,
bootstrap summaries.

Averaging across repeated evaluations does not remove all uncertainty, but it often improves stability.

That logic is closely aligned with the LLN.

As more evaluation replications are included, the average performance estimate becomes more dependable than any single split.

This is especially important when communicating model performance to decision-makers.

The LLN Is Also a Governance Idea

At a practical level, the Law of Large Numbers supports a governance principle:

do not overinterpret noisy early averages.

This matters in:

interim trial reviews,
early model validation,
pilot deployments,
monitoring dashboards,
and sequential reporting.

Small-sample summaries can appear dramatic. Some of that drama is just variance.

The LLN reminds us that reliability often comes from repeated accumulation, not from the first striking number.

A Simulation of Repeated Study Means

Another useful way to understand the LLN is to compare study-level means across increasing sample sizes.

Below, we simulate many studies and examine how the study means become more concentrated as sample size grows.

simulate_study_means <- function(n, n_reps = 3000) {
  tibble::tibble(
    rep = 1:n_reps,
    study_mean = replicate(n_reps, mean(rnorm(n, mean = 50, sd = 12))),
    n = paste0("n = ", n)
  )
}

study_means_df <- dplyr::bind_rows(
  simulate_study_means(5),
  simulate_study_means(20),
  simulate_study_means(50),
  simulate_study_means(200)
)

ggplot2::ggplot(study_means_df, ggplot2::aes(x = study_mean)) +
  ggplot2::geom_histogram(bins = 35) +
  ggplot2::facet_wrap(~ n, scales = "free_y") +
  ggplot2::labs(
    title = "Study Means Become More Concentrated as Sample Size Increases",
    x = "Study Mean",
    y = "Frequency"
  ) +
  ggplot2::theme_minimal()

This reinforces the same lesson: larger samples produce more stable averages.

A Practical Checklist for Applied Work

Before trusting a sample average, ask:

Is the sample size large enough for the quantity to stabilize?
Is the data-generating process reasonably representative?
Are observations independent enough for the average to behave well?
Could high variance or heavy tails slow convergence?
Am I averaging a meaningful quantity, such as loss, rate, or response?
Have I confused stability of the average with correctness of the model?

These questions improve both modeling and interpretation.

Where This Shows Up in AI/ML

AUC estimates reported from validation sets of 200–300 patients — typical for many published military health AI studies — have confidence intervals wide enough to be clinically meaningless; the LLN guarantees that a sample AUC converges to the true AUC, but convergence is slow enough that a reported AUC of 0.81 on n=250 could easily reflect a true AUC anywhere from 0.72 to 0.88. Deployment decisions based on small-sample AUC estimates at military treatment facilities have led to adoption of models that degraded to near-chance performance in prospective use, a failure that the LLN directly predicts and that adequate sample size planning would prevent.

Closing: Reliability Comes from Repetition, Not Wishful Thinking

The Law of Large Numbers is one of the key reasons analysts can learn from data.

It tells us that repeated observations, under appropriate conditions, allow sample averages to approach true population quantities.

That idea supports:

clinical estimation,
trial monitoring,
model evaluation,
empirical risk minimization,
and the consistency of many familiar estimators.

It does not remove bias. It does not rescue poor data. And it does not guarantee that every large dataset is informative.

But it does explain why averaging across enough observations can transform noise into something interpretable.

The Law of Large Numbers is one of the quiet foundations of reliable science and machine learning: it explains why enough data can make averages worth trusting.

📚 Go Deeper: Prediction Modeling Toolkit

This post is part of the Prediction Modeling Toolkit — a companion reference with convergence diagnostics, empirical risk estimation templates, and sample-size sensitivity checks.

→ Open the Prediction Modeling Toolkit

Series Callout

Note

This post is part of a broader Applied Statistics for AI and Clinical Decision-Making Series:

Probability fundamentals for machine learning
Random variables and expectation
Common probability distributions
Central Limit Theorem
Law of Large Numbers
Sampling methods for Biostats and ML
Hypothesis testing in the age of AI
Confidence intervals
Maximum likelihood estimation
Bayesian inference
Linear regression
Logistic regression
Generalized linear models
Analysis of variance
Principal component analysis
Cluster analysis
Time series analysis
Survival analysis
Non-parametric methods
Bias-variance tradeoff
Regularization
Cross-validation
Information theory
Optimization techniques
Linear algebra basics
Calculus for ML
Monte Carlo methods
Dimensionality curse and reduction techniques
Model evaluation metrics
Ensemble methods

Series: Applied Statistics for AI & Clinical Decision-Making

← The CLT Magic: How It Turns Chaos into Predictable AI Insights | Mastering Sampling: From Biostats Surveys to ML Data Prep →

References

Casella, George, and Roger L. Berger. 2002. Statistical Inference. 2nd ed. Duxbury.

DeGroot, Morris H., and Mark J. Schervish. 2012. Probability and Statistics. 4th ed. Pearson.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer.

Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer.