Probability Foundations

Applied Statistics for AI & Clinical Decision-Making — Lecture 1 of 10

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

Probability is not a prerequisite.

It is the language.

What You’ll Learn Today

Three posts · One lecture · ~50 minutes

Post 1 How Probability Powers AI

What probability is
Bayes’ theorem
Contingency tables

Post 2 Random Variables

Discrete vs. continuous
Expected value & variance
Distributions as models

Post 3 Key Distributions

Normal, Bernoulli, Poisson
Clinical applications
Choosing the right one

Part 1

How Probability Powers AI

From spam filters to clinical triage

Why Probability? The Honest Answer

AI does not eliminate uncertainty. It formalizes decisions under uncertainty.

Every AI/ML system that outputs a “prediction” is really computing a probability:

Email classifier: P(spam | features)
Triage algorithm: P(deterioration | vitals, labs)
Diagnostic model: P(injury | imaging + mechanism)

In trauma care — vitals may be incomplete, mechanism may be wrong, hemorrhage may be occult.

Probability gives us a disciplined way to reason under incomplete information.

Kolmogorov’s Three Axioms

All of probability theory rests on three rules:

Axiom 1 (Non-negativity): \(P(A) \geq 0\)

Axiom 2 (Normalization): \(P(S) = 1\)

Axiom 3 (Additivity): If A, B are mutually exclusive: \(P(A \cup B) = P(A) + P(B)\)

Why they matter for AI: A classifier that outputs incoherent probabilities (e.g., probabilities that don’t sum to 1) is mathematically invalid, not just poorly calibrated.

Joint, Marginal & Conditional Probability

The three types you’ll use constantly:

Type	Definition	Example
Joint	\(P(A \cap B)\)	P(contains “free” AND is spam)
Marginal	\(P(A) = \sum_B P(A,B)\)	P(spam) regardless of words
Conditional	\(P(A \mid B) = \frac{P(A,B)}{P(B)}\)	P(spam \| contains “free”)

Trauma translation:

Marginal = overall mortality rate in the registry
Conditional = mortality given hemorrhagic shock AND no prehospital TXA
Joint = probability of both shock AND ICU admission

Contingency Tables: Probability Engines

A 2×2 table gives you the entire joint probability space:

email_df <- tibble::tibble(
  contains_free = c("Yes","Yes","Yes","Yes","No","No","No","No","Yes","No"),
  spam          = c("Yes","Yes","No","Yes","No","No","Yes","No","Yes","No")
)

ct <- table(email_df$contains_free, email_df$spam)
prop.table(ct)        # joint probabilities

     
       No Yes
  No  0.4 0.1
  Yes 0.1 0.4

Row sums → marginal P(contains_free) Column sums → marginal P(spam) Each cell → joint P(contains_free, spam)

Bayes’ Theorem: The Engine of Belief Updating

\[P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}\]

Posterior ∝ Likelihood × Prior

In clinical triage:

Prior — baseline injury rate in this patient population
Likelihood — how often this vital-sign pattern occurs given injury
Posterior — updated probability of injury given what you’ve observed

Why it matters for AI:

Every Bayesian model, naive Bayes classifier, and probabilistic neural network is applying this one formula.

Quick R: Bayes’ Theorem from a Table

# Simulate: 1000 trauma patients, 15% have occult hemorrhage
# Positive FAST exam sensitivity = 0.85, specificity = 0.92

n <- 10000
p_hem    <- 0.15
sens     <- 0.85
spec     <- 0.92

# Compute positive predictive value (PPV) — Bayes in one formula
p_pos_given_hem    <- sens
p_pos_given_no_hem <- 1 - spec

p_pos <- p_pos_given_hem * p_hem + p_pos_given_no_hem * (1 - p_hem)

ppv <- (p_pos_given_hem * p_hem) / p_pos
tibble::tibble(PPV = round(ppv, 3), FDR = round(1 - ppv, 3))

# A tibble: 1 × 2
    PPV   FDR
  <dbl> <dbl>
1 0.652 0.348

Part 2

Random Variables

The mathematical bridge between events and numbers

What Is a Random Variable?

A random variable maps outcomes of a random process to numbers we can compute with.

Discrete random variables Take countable values:

Number of transfusions in 24 hrs
Injury Severity Score (ISS)
30-day readmission (0 or 1)

Continuous random variables Take any value in a range:

Systolic blood pressure
Time to OR
Lab values (lactate, Hgb)

The distribution of a random variable tells you how likely each value (or range of values) is.

Expected Value and Variance

Two numbers that summarize any distribution:

Expected value (mean): \(E[X] = \sum_x x \cdot P(X=x)\)

Variance: \(\text{Var}(X) = E[(X - \mu)^2]\)

Standard deviation: \(\text{SD}(X) = \sqrt{\text{Var}(X)}\)

iss_sim <- tibble::tibble(
  iss = c(9,16,25,9,4,36,16,25,9,16,25,75,9,4,16)
)
iss_sim |>
  dplyr::summarise(
    mean_iss  = mean(iss),
    sd_iss    = sd(iss),
    var_iss   = var(iss)
  )

# A tibble: 1 × 3
  mean_iss sd_iss var_iss
     <dbl>  <dbl>   <dbl>
1     19.6   17.8    315.

Why Variance Matters in Clinical AI

Two models with the same mean prediction can have very different clinical implications:

set.seed(123)
preds <- tibble::tibble(
  model_a = rnorm(500, mean = 0.3, sd = 0.05),
  model_b = rnorm(500, mean = 0.3, sd = 0.18)
) |>
  tidyr::pivot_longer(everything(), names_to = "model", values_to = "pred_risk")

ggplot2::ggplot(preds, ggplot2::aes(x = pred_risk, fill = model)) +
  ggplot2::geom_density(alpha = 0.5) +
  ggplot2::labs(title = "Same mean risk — very different uncertainty",
                x = "Predicted Risk", y = "Density") +
  theme_di()

Part 3

Key Probability Distributions

The models behind the data

The Distribution Zoo: Which to Use When

Distribution	Data type	Classic use in trauma/AI
Bernoulli	Binary outcome (0/1)	Mortality, complication (yes/no)
Binomial	Count of successes in n trials	# of CPG-compliant cases out of 20
Poisson	Count of rare events	ED arrivals per hour, rare complications
Normal	Continuous, symmetric	Lab values, physiologic scores
Exponential	Time until event	Time to hemorrhage control
Beta	Proportions (0–1)	Prior on compliance rate
Gamma / Weibull	Skewed positive continuous	Survival time, ICU LOS

The Normal Distribution: Why It’s Everywhere

x <- seq(-4, 4, length.out = 300)

tibble::tibble(x = x, y = dnorm(x)) |>
  ggplot2::ggplot(ggplot2::aes(x, y)) +
  ggplot2::geom_line(linewidth = 1.2, color = "#2563eb") +
  ggplot2::geom_area(fill = "#2563eb", alpha = 0.12) +
  ggplot2::labs(title = "Standard Normal N(0,1)",
                x = "Standard deviations from mean", y = "Density") +
  theme_di()

The 68-95-99.7 rule:

68% of data within ±1 SD
95% within ±2 SD
99.7% within ±3 SD

It’s ubiquitous because of the CLT — sums of many independent variables converge to Normal (Lecture 2).

Clinical: Systolic BP, hematocrit, temperature — all approximately Normal in stable populations.

Poisson: When Events Are Rare and Independent

lambdas <- c(1, 3, 8)
pois_df <- purrr::map_dfr(lambdas, function(l) {
  tibble::tibble(
    x = 0:20,
    prob = dpois(0:20, lambda = l),
    lambda = paste0("λ = ", l)
  )
})

ggplot2::ggplot(pois_df, ggplot2::aes(x, prob, fill = lambda)) +
  ggplot2::geom_col(position = "dodge", alpha = 0.85) +
  ggplot2::scale_fill_brewer(palette = "Blues") +
  ggplot2::labs(title = "Poisson distribution — three rates",
                x = "Count", y = "P(X = x)", fill = NULL) +
  theme_di()

Clinical application: Rare adverse events — anastomotic leaks, intraoperative cardiac arrest, battlefield tourniquet failures. When mean ≈ variance and events are independent, Poisson is the right model.

Choosing a Distribution: The Decision Tree

Is the outcome binary (0/1)?
  → Bernoulli / Binomial

Is the outcome a count?
  → Poisson (if mean ≈ variance)
  → Negative Binomial (if overdispersed)

Is the outcome time-to-event?
  → Exponential / Weibull / Cox (Lecture 6)

Is the outcome continuous and symmetric?
  → Normal

Is the outcome continuous and right-skewed?
  → Gamma / Log-Normal

Is the outcome a proportion (0–1)?
  → Beta (as a prior) / logistic regression

Lecture 1 — Key Takeaways

Probability

Three axioms make probability coherent
Joint, marginal, conditional are always relationships between the same numbers
Bayes’ theorem formalizes how evidence updates belief

Random Variables

Map events to numbers
Discrete vs. continuous is a modeling choice
E[X] and Var(X) summarize any distribution

Distributions

Bernoulli/Binomial → binary and count outcomes
Normal → symmetric continuous data
Poisson → rare, independent events
Exponential/Weibull → time-to-event

Clinical Application

Every risk score is a posterior probability
Base rates matter enormously
Distribution choice is a scientific claim, not a formality

Coming Up: Lecture 2

The Laws That Make Statistics Work

Posts 04, 05, 06:

CLT — why sample means become Normal, regardless of original distribution
LLN — why bigger samples converge to truth
Sampling — strategies, bias, sample size, design implications

These three results are why statistics can make reliable inferences from incomplete data.

Read Before Lecture 2

Resources

Blog posts for this lecture:

Series toolkit:

Prediction Modeling Toolkit

Data InDeed:

dataindeed.org
Stats series home: Applied Statistics for AI