Building Better Models: Bias, Variance & Validation

Applied Statistics for AI & Clinical Decision-Making — Lecture 8 of 10

Jonathan D. Stallings, PhD, MS

Data InDeed | dataindeed.org

2026-01-01

A model that memorizes training data has learned nothing.

What You’ll Learn Today

Post 20 Bias-Variance

  • The fundamental tradeoff
  • Underfitting vs. overfitting
  • Decomposing MSE

Post 21 Regularization

  • Ridge (L2)
  • Lasso (L1)
  • Elastic net

Post 22 Cross-Validation

  • k-fold CV
  • Leave-one-out
  • Selecting hyperparameters

Part 1

The Bias-Variance Tradeoff

The fundamental tension in model complexity

Decomposing Prediction Error

\[E[\text{MSE}] = \text{Bias}^2 + \text{Variance} + \text{Irreducible noise}\]

Bias — error from wrong assumptions (underfitting)

  • Model is too simple
  • Misses real patterns
  • High training error AND test error

Variance — error from sensitivity to training data (overfitting)

  • Model is too complex
  • Memorizes noise
  • Low training error, HIGH test error

Irreducible noise — cannot be reduced by any model

Registry analogy:

A linear model for ISS → mortality may be biased (relationship is not linear at extremes).

A 500-node neural network on 200 patients is high variance — it memorizes those 200 patients but won’t generalize.

The Bias-Variance Curve

tibble(
  complexity = 1:10,
  bias_sq = (5 / (1:10))^2,
  variance = (0.3 * (1:10))^2,
) |>
  dplyr::mutate(total = bias_sq + variance + 4) |>
  tidyr::pivot_longer(-complexity) |>
  ggplot(aes(complexity, value, color=name, linewidth=name)) +
  geom_line() +
  scale_color_manual(values=c("bias_sq"="#e63946","variance"="#2563eb",
                              "total"="#1b2e4b")) +
  scale_linewidth_manual(values=c("bias_sq"=0.9,"variance"=0.9,"total"=1.4)) +
  labs(title="Bias² + Variance + noise = total expected error",
       x="Model complexity", y="Error", color=NULL, linewidth=NULL) + theme_di()

The sweet spot: find the complexity that minimizes total error — not bias alone or variance alone.

Part 2

Regularization

Penalizing complexity to prevent overfitting

Ridge (L2) and Lasso (L1)

Ridge: minimize \(\text{RSS} + \lambda \sum_j \beta_j^2\)

Lasso: minimize \(\text{RSS} + \lambda \sum_j |\beta_j|\)

Property Ridge Lasso
Shrinks coefficients Yes Yes
Sets coefficients to zero No Yes
Variable selection No Yes
Best when All features matter Many irrelevant features
Solution Closed form Coordinate descent

Elastic Net = combination of both: \(\alpha \cdot \text{Lasso} + (1-\alpha) \cdot \text{Ridge}\)

Lasso Path: Automatic Feature Selection

n <- 200; p <- 20
X <- matrix(rnorm(n*p), n, p)
# Only first 4 features truly predict outcome
beta_true <- c(2,-1.5,1,-0.8, rep(0,16))
y <- X %*% beta_true + rnorm(n)

fit_lasso <- glmnet(X, y, alpha=1)
plot(fit_lasso, xvar="lambda", label=TRUE,
     main="Lasso coefficient paths — features shrink to zero")

As λ increases (more regularization), coefficients are forced to zero. The 4 true predictors (1–4) survive longest.

Choosing λ with Cross-Validation

cv_lasso <- cv.glmnet(X, y, alpha=1, nfolds=10)
plot(cv_lasso)
cat("Optimal λ:", round(cv_lasso$lambda.min, 4),
    "\nλ 1-SE rule:", round(cv_lasso$lambda.1se, 4))
Optimal λ: 0.0489 
λ 1-SE rule: 0.1495

1-SE rule: prefer the simplest model within 1 standard error of the minimum CV error — favors parsimony.

Part 3

Cross-Validation

The honest way to estimate generalization error

k-Fold Cross-Validation

1. Divide data into k roughly equal folds
2. For each fold i = 1...k:

   - Train on all folds except i
   - Evaluate on fold i
3. Average performance across all k test folds
manual_kfold <- function(df, k=5) {
  folds <- sample(rep(1:k, length.out=nrow(df)))
  sapply(1:k, function(i) {
    train <- df[folds != i, ]; test <- df[folds == i, ]
    fit <- lm(los ~ iss + sbp, data=train)
    sqrt(mean((test$los - predict(fit, test))^2))
  })
}
df_cv <- tibble(iss=rnorm(200,28,12), sbp=rnorm(200,110,20),
                los=2+0.4*iss-0.02*sbp+rnorm(200,0,4))
rmse_folds <- manual_kfold(df_cv)
tibble(fold=1:5, RMSE=round(rmse_folds,2)) |> print()
# A tibble: 5 × 2
   fold  RMSE
  <int> <dbl>
1     1  4.34
2     2  4.8 
3     3  3.42
4     4  3.69
5     5  3.69
cat("CV RMSE:", round(mean(rmse_folds),3), "±", round(sd(rmse_folds),3))
CV RMSE: 3.99 ± 0.569

Leave-One-Out vs. k-Fold

LOO-CV k-Fold (k=5 or 10)
Bias Very low Low
Variance High Lower
Computation Expensive (n fits) Cheap (k fits)
Best for Small datasets Most situations

General recommendation: 5- or 10-fold CV is the standard. LOO when n < 50.

Temporal validation for registry models: In clinical practice, always validate forward in time. Split by enrollment date — train on 2019-2022, test on 2023-2024. This catches data drift that k-fold CV (which ignores time) cannot detect.

Lecture 8 — Key Takeaways

Bias-Variance

  • MSE = Bias² + Variance + noise
  • Underfitting (high bias) = model too simple
  • Overfitting (high variance) = model too complex
  • Goal: find complexity that minimizes total error

Regularization

  • Ridge: shrinks all coefficients
  • Lasso: zeroes out irrelevant features (selection)
  • Elastic net: combines both
  • λ chosen by cross-validation

Cross-Validation

  • k-fold: standard approach (k=5 or 10)
  • LOO: for very small n
  • Temporal CV: required for longitudinal registry data
  • CV error = honest estimate of generalization

The meta-lesson: A model evaluated only on training data tells you nothing about how it will perform on future patients.

Coming Up: Lecture 9

Model Evaluation & Ensembles

Posts 29, 30, 17:

  • Metrics That Matter — AUC, F1, calibration, clinical utility
  • Ensembles — bagging, boosting, random forests
  • Time Series — autoregression, seasonality, forecasting