Regularization Techniques: Taming Wild ML Models

Applied Statistics

AI and Clinical Decision-Making

A practical introduction to ridge, lasso, elastic net, and cross-validated penalty tuning for controlling overfitting in AI and applied statistics.

Published

November 15, 2024

Modified

June 9, 2026

Executive Summary

One of the most common reasons predictive models fail is that they become too eager.

They chase noise. They react too strongly to quirks of the training sample. They produce coefficients that look impressive in-sample but behave poorly out-of-sample.

This is the problem regularization was built to address (Hoerl and Kennard 1970; Tibshirani 1996; Zou and Hastie 2005).

Regularization adds structure and restraint to model fitting. Instead of asking only:

which coefficients minimize training error?

it also asks:

how large should those coefficients be allowed to become?
how much complexity is justified by the data?
and how can we stabilize prediction in high-dimensional settings?

This matters especially when:

predictors are numerous,
collinearity is present,
signal is weak relative to noise,
or the model is vulnerable to overfitting.

This post introduces:

L2 regularization, or ridge regression,
L1 regularization, or lasso,
elastic net as a hybrid approach,
tuning lambda with cross-validation,
and coefficient shrinkage as a visual way to understand model behavior.

Regularization matters because a model that fits too freely often learns the sample better than it learns the signal.

Regularization Is a Response to Model Instability

Ordinary least squares estimates coefficients by minimizing residual error.

That works well in some settings, but it can become unstable when:

predictors are highly correlated,
the number of predictors is large,
noise is substantial,
or the sample size is limited relative to model complexity.

In those cases, unconstrained fitting can produce coefficient estimates that are:

too large,
too variable,
and too sensitive to small changes in the data.

Regularization addresses this by penalizing model complexity.

This often increases bias slightly, but reduces variance enough to improve generalization.

That is why regularization is one of the clearest practical responses to the bias-variance tradeoff.

Regularization Changes the Optimization Problem

In ordinary regression, the fitting objective is usually something like:

\[ \min_{\beta} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \] Regularization modifies that objective by adding a penalty term.

The model is no longer rewarded only for fit. It is also penalized for coefficient magnitude.

That means the fitting problem becomes:

\[ \text{loss} + \text{penalty} \] This is the key conceptual move.

Instead of asking for the best-fitting unconstrained model, regularization asks for the best-fitting model subject to restraint.

That restraint is what stabilizes the fit.

L2 Regularization Shrinks Coefficients Smoothly

L2 regularization, commonly called ridge regression, adds a penalty based on the squared size of the coefficients.

\[ \min_{\beta} \sum_{i=1}^n (y_i - \hat{y}*i)^2 + \lambda \sum*{j=1}^p \beta_j^2 \] Here, () controls how strongly the penalty is applied.

As () increases:

coefficients shrink more strongly toward zero
the model becomes less flexible
variance is reduced
but bias may increase

Ridge regression is especially useful when predictors are correlated and the analyst wants to stabilize estimation without necessarily removing variables.

L1 Regularization Encourages Sparsity

L1 regularization, commonly called lasso, adds a penalty based on the absolute values of the coefficients.

\[ \min_{\beta} \sum_{i=1}^n (y_i - \hat{y}*i)^2 + \lambda \sum*{j=1}^p |\beta_j| \] This penalty has a different effect from ridge.

Instead of only shrinking coefficients, lasso often drives some coefficients exactly to zero.

That means lasso performs both:

shrinkage
and variable selection

This is one reason lasso is so popular in high-dimensional settings. It can produce a simpler, sparser model while still controlling overfitting.

Elastic Net Combines L1 and L2 Penalties

Sometimes ridge is too soft and lasso is too aggressive.

That is where elastic net becomes useful.

Elastic net combines the L1 and L2 penalties:

\[ \text{loss} + \lambda \left[ \alpha \sum |\beta_j| + (1-\alpha)\sum \beta_j^2 \right] \] Here:

() controls overall penalty strength
() controls the balance between L1 and L2 behavior

Special cases:

(= 1): lasso
(= 0): ridge
(0 < < 1): elastic net

Elastic net is especially useful when predictors are strongly correlated and we want both shrinkage and some sparsity.

A High-Dimensional Example Makes the Need for Regularization Concrete

To illustrate, we will simulate a regression setting with:

many predictors,
only a few true signals,
and correlation among predictors.

This is exactly the kind of setting where ordinary least squares can become unstable.

library(dplyr)
library(tibble)
library(ggplot2)

n <- 180
p <- 20

x1 <- rnorm(n)
x2 <- x1 + rnorm(n, sd = 0.2)
x3 <- rnorm(n)
x4 <- x3 + rnorm(n, sd = 0.2)

x_mat <- cbind(
  x1, x2, x3, x4,
  matrix(rnorm(n * (p - 4)), nrow = n, ncol = p - 4)
)

colnames(x_mat) <- paste0("x", 1:p)

beta_true <- c(2.0, 0.0, -1.5, 0.0, rep(0, p - 4))

y <- 3 + x_mat %*% beta_true + rnorm(n, sd = 2)

reg_df <- as.data.frame(x_mat) |>
  tibble::as_tibble() |>
  dplyr::mutate(y = as.numeric(y))

reg_df |>
  dplyr::select(y, dplyr::everything()) |>
  dplyr::slice_head(n = 5)

# A tibble: 5 × 21
       y      x1      x2     x3     x4      x5     x6      x7     x8      x9
   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
1  6.09   0.721   0.853  -0.248 -0.416 -0.0702 -0.728  0.0303  0.270 -0.832 
2  0.864 -0.722  -0.916   0.866  1.06  -0.609  -0.246  1.46   -1.15   0.0132
3  0.257  0.0989  0.0260  0.294  0.404  1.82    0.706 -0.139   0.230  1.58  
4 -0.508 -1.17   -1.15    0.758  0.409 -0.263   0.306  0.109   0.158  1.16  
5  3.39  -1.94   -2.16   -0.446 -0.261  1.04   -0.472  0.310   0.815 -0.591 
# ℹ 11 more variables: x10 <dbl>, x11 <dbl>, x12 <dbl>, x13 <dbl>, x14 <dbl>,
#   x15 <dbl>, x16 <dbl>, x17 <dbl>, x18 <dbl>, x19 <dbl>, x20 <dbl>

Here, only a small subset of predictors actually contribute signal, while several others are noise.

Ordinary Least Squares Can Become Unstable in This Setting

We can fit an ordinary least squares model using all predictors.

fit_ols <- lm(y ~ ., data = reg_df)

summary(fit_ols)$coefficients[1:10, ]

               Estimate Std. Error    t value     Pr(>|t|)
(Intercept)  2.66029895  0.1567371 16.9730046 1.583942e-37
x1           2.14113426  0.8125815  2.6349778 9.246797e-03
x2          -0.20577872  0.7927870 -0.2595637 7.955361e-01
x3          -2.96519182  0.7852058 -3.7763245 2.244596e-04
x4           1.27004847  0.7762811  1.6360678 1.038034e-01
x5          -0.19551736  0.1688745 -1.1577675 2.486951e-01
x6           0.25753480  0.1462894  1.7604471 8.025473e-02
x7          -0.08695647  0.1523002 -0.5709543 5.688367e-01
x8          -0.16327011  0.1512467 -1.0794950 2.820019e-01
x9          -0.11727406  0.1548390 -0.7573934 4.499352e-01

This model may fit the training data reasonably well, but it is vulnerable to instability because:

several predictors are irrelevant,
some predictors are correlated,
and the model is free to allocate weight aggressively.

This is exactly the kind of situation where regularization becomes useful.

Ridge, Lasso, and Elastic Net Are Easy to Fit with `glmnet`

The glmnet framework made penalized regression workflows broadly accessible in applied work and remains a standard implementation for ridge, lasso, and elastic net models (Friedman et al. 2010).

A practical way to fit regularized regression in R is the glmnet package.

required_pkgs <- c("glmnet")
missing_pkgs <- required_pkgs[
  !vapply(required_pkgs, requireNamespace, logical(1), quietly = TRUE)
$$
if (length(missing_pkgs) > 0) {
  stop("Missing packages: ", paste(missing_pkgs, collapse = ", "))
}

x <- model.matrix(y ~ ., data = reg_df)[, -1]
y_vec <- reg_df$y

Now fit ridge, lasso, and elastic net.

fit_ridge <- glmnet::glmnet(x, y_vec, alpha = 0)
fit_lasso <- glmnet::glmnet(x, y_vec, alpha = 1)
fit_enet  <- glmnet::glmnet(x, y_vec, alpha = 0.5)

These fits trace coefficient paths across many values of ().

Coefficient Shrinkage Plots Make Regularization Visually Intuitive

One of the clearest ways to understand regularization is through coefficient path plots.

These show how coefficients change as the penalty strength varies.

Ridge coefficient paths

plot(fit_ridge, xvar = "lambda", label = TRUE, main = "Ridge Coefficient Paths")

Lasso coefficient paths

plot(fit_lasso, xvar = "lambda", label = TRUE, main = "Lasso Coefficient Paths")

These plots help show the central logic:

ridge shrinks coefficients smoothly
lasso shrinks some coefficients all the way to zero

This is often the most intuitive way to teach the difference between L1 and L2 penalties.

Lambda Controls How Aggressively the Model Is Penalized

The tuning parameter () is central.

Small ():

weak penalty
more flexible fit
lower bias
higher variance risk

Large ():

strong penalty
stronger shrinkage
higher bias
lower variance

This means the choice of () directly controls the complexity of the fitted model.

That is why tuning lambda carefully is one of the most important steps in regularized modeling.

Cross-Validation Is the Standard Way to Tune Lambda

In practice, () is usually chosen using cross-validation.

This asks:

for different candidate values of (),
which one gives the best out-of-sample performance?

That makes cross-validation the practical tool for navigating the bias-variance tradeoff.

Ridge with CV

cv_ridge <- glmnet::cv.glmnet(x, y_vec, alpha = 0)
plot(cv_ridge)
cv_ridge$lambda.min
cv_ridge$lambda.1se

Lasso with CV

cv_lasso <- glmnet::cv.glmnet(x, y_vec, alpha = 1)
plot(cv_lasso)
cv_lasso$lambda.min
cv_lasso$lambda.1se

The two common choices are:

lambda.min: the lambda with the lowest CV error
lambda.1se: a more regularized value within one standard error of the minimum

The second is often preferred when a simpler model is desirable.

Lasso Produces Sparsity, Which Helps with Feature Selection

A major reason lasso is so popular is that it can set coefficients exactly to zero.

That means it does not only stabilize prediction. It can also help identify a smaller subset of predictors.

coef(cv_lasso, s = "lambda.min")
coef(cv_lasso, s = "lambda.1se")

This is especially useful when:

predictors are numerous,
interpretation matters,
and a sparse model is easier to communicate or deploy.

That said, lasso-based selection should still be interpreted carefully. Selected predictors are not automatically “the true variables.” They are the variables favored under the data, penalty, and tuning structure.

Ridge Helps When Predictors Are Correlated

When predictors are highly correlated, lasso can behave somewhat erratically, sometimes selecting one variable and dropping another similar one.

Ridge handles this more gently.

Because ridge shrinks coefficients without forcing hard zeros, it often distributes weight more smoothly across correlated predictors.

This makes ridge especially useful when:

prediction is the main goal,
collinearity is strong,
and a stable fit matters more than sparse selection.

So the choice between ridge and lasso is partly about the analytic goal:

ridge for stability
lasso for sparsity
elastic net when both matter

Elastic Net Is Often a Good Compromise in Real Data

Elastic net is particularly attractive when predictors are both:

numerous
and correlated

In those settings, pure lasso may be too unstable and pure ridge may be too dense.

Elastic net blends the two ideas.

cv_enet <- glmnet::cv.glmnet(x, y_vec, alpha = 0.5)
plot(cv_enet)
cv_enet$lambda.min

In many applied datasets, elastic net is a practical default because it:

stabilizes estimation
allows some sparsity
and handles grouped correlation better than pure lasso alone

Regularization Is Not Only for Linear Regression

Although ridge, lasso, and elastic net are often introduced in ordinary regression, the underlying idea is much broader.

Regularization appears in:

logistic regression
Cox models
neural networks
matrix factorization
graphical models
and many other ML systems

In deep learning, for example, regularization ideas reappear as:

weight decay
dropout
early stopping
and architecture constraints

That is why regularization is not just one technique. It is a general strategy for taming overly flexible models.

Regularization Does Not “Fix” Bad Data

Regularization is powerful, but it is not magic.

It cannot rescue:

poor outcome definition
unrepresentative sampling
missing-not-at-random problems
data leakage
severe measurement error
or causal confounding

A strongly regularized model can still be wrong if the underlying data or question are wrong.

This is an important lesson in both statistics and ML:

controlling model complexity does not replace careful data design.

A Simple Performance Comparison Helps Reinforce the Point

We can compare the general logic of model performance by looking at train/test splits.

set.seed(20260315)

train_idx <- sample(seq_len(nrow(reg_df)), size = 120)
x_train <- x[train_idx, ]
x_test  <- x[-train_idx, ]
y_train <- y_vec[train_idx]
y_test  <- y_vec[-train_idx]

cv_ridge_tt <- glmnet::cv.glmnet(x_train, y_train, alpha = 0)
cv_lasso_tt <- glmnet::cv.glmnet(x_train, y_train, alpha = 1)

pred_ridge <- predict(cv_ridge_tt, newx = x_test, s = "lambda.min")
pred_lasso <- predict(cv_lasso_tt, newx = x_test, s = "lambda.min")

mse_ridge <- mean((y_test - pred_ridge)^2)
mse_lasso <- mean((y_test - pred_lasso)^2)

tibble::tibble(
  model = c("Ridge", "Lasso"),
  test_mse = c(mse_ridge, mse_lasso)
)

This is the practical point of regularization: not prettier coefficients, but better out-of-sample behavior.

Regularization Is One of the Most Practical Answers to Overfitting

If the bias-variance tradeoff explains the problem, regularization is one of the most important practical answers.

It helps by:

shrinking unstable coefficients
discouraging extreme fits
improving generalization
and making models more robust in high-dimensional settings

That is why regularization is foundational in modern ML.

It is not an optional refinement. It is often part of what makes the model usable at all.

A Practical Checklist for Applied Work

Before fitting a regularized model, ask:

Is overfitting a realistic concern here?
Are predictors numerous relative to sample size?
Is multicollinearity present?
Is prediction or sparse selection the main goal?
Should ridge, lasso, or elastic net be the starting point?
Has lambda been tuned with cross-validation?
Am I interpreting selected variables too confidently?
Would a simpler baseline or stronger regularization generalize better?

These questions usually improve both modeling and interpretation.

Where This Shows Up in AI/ML

Lasso regularization is used directly in trauma mortality prediction pipelines that source features from the DoDTR — where hundreds of injury descriptors, vitals, and procedure codes create exactly the high-dimensional, collinear setting lasso was designed for, and where sparse coefficient selection improves model transportability across MTFs. Every production clinical ML model trained in Epic’s model marketplace applies L2 regularization (weight decay) to penalize coefficient magnitude and control overfitting to the training institution’s case mix. When a model is deployed without adequate regularization — or with a lambda tuned only on a single-site training cohort — the coefficients fit the training noise rather than the signal, and performance degrades sharply the moment the patient population shifts. A regularized model shipped from a Level I trauma center to a forward surgical team is not the same as an unregularized model: the former has constrained its ambition; the latter has memorized a hospital it will never see again.

Closing: Regularization Adds Discipline to Flexible Modeling

Regularization remains one of the most useful ideas in machine learning because it confronts a basic danger directly:

models that fit too freely often generalize too poorly.

Ridge regression stabilizes coefficients. Lasso encourages sparsity. Elastic net blends both ideas. Cross-validation helps tune the strength of restraint.

Together, these methods show that better prediction is often not about fitting harder (Hastie et al. 2009; James et al. 2021). It is about fitting more responsibly.

Regularization matters because a model needs enough flexibility to learn the signal, but enough discipline not to memorize the noise.

📚 Go Deeper: Prediction Modeling Toolkit

This post is part of the Prediction Modeling Toolkit — a companion reference with ridge, lasso, and elastic net templates, lambda tuning code, and shrinkage diagnostics for high-dimensional clinical data.

→ Open the Prediction Modeling Toolkit

Series Callout

Note

This post is part of a broader Applied Statistics for AI and Clinical Decision-Making Series:

Probability fundamentals for machine learning
Random variables and expectation
Common probability distributions
Central Limit Theorem
Law of Large Numbers
Sampling methods for Biostats and ML
Hypothesis testing in the age of AI
Confidence intervals
Maximum likelihood estimation
Bayesian inference
Linear regression
Logistic regression
Generalized linear models
Analysis of variance
Principal component analysis
Cluster analysis
Time series analysis
Survival analysis
Non-parametric methods
Bias-variance tradeoff
Regularization
Cross-validation
Information theory
Optimization techniques
Linear algebra basics
Calculus for ML
Monte Carlo methods
Dimensionality curse and reduction techniques
Model evaluation metrics
Ensemble methods

Series: Applied Statistics for AI & Clinical Decision-Making

← Bias-Variance: The Key to Balanced AI Models | Cross-Validation: Ensuring Your AI Isn’t Fooling Itself →

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.

Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer.

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B 58 (1): 267–88.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B 67 (2): 301–20.