A practical introduction to ridge, lasso, elastic net, and cross-validated penalty tuning for controlling overfitting in AI and applied statistics.
Published
November 15, 2024
Modified
June 9, 2026
Executive Summary
One of the most common reasons predictive models fail is that they become too eager.
They chase noise. They react too strongly to quirks of the training sample. They produce coefficients that look impressive in-sample but behave poorly out-of-sample.
Regularization adds structure and restraint to model fitting. Instead of asking only:
which coefficients minimize training error?
it also asks:
how large should those coefficients be allowed to become?
how much complexity is justified by the data?
and how can we stabilize prediction in high-dimensional settings?
This matters especially when:
predictors are numerous,
collinearity is present,
signal is weak relative to noise,
or the model is vulnerable to overfitting.
This post introduces:
L2 regularization, or ridge regression,
L1 regularization, or lasso,
elastic net as a hybrid approach,
tuning lambda with cross-validation,
and coefficient shrinkage as a visual way to understand model behavior.
Regularization matters because a model that fits too freely often learns the sample better than it learns the signal.
Regularization Is a Response to Model Instability
Ordinary least squares estimates coefficients by minimizing residual error.
That works well in some settings, but it can become unstable when:
predictors are highly correlated,
the number of predictors is large,
noise is substantial,
or the sample size is limited relative to model complexity.
In those cases, unconstrained fitting can produce coefficient estimates that are:
too large,
too variable,
and too sensitive to small changes in the data.
Regularization addresses this by penalizing model complexity.
This often increases bias slightly, but reduces variance enough to improve generalization.
That is why regularization is one of the clearest practical responses to the bias-variance tradeoff.
Regularization Changes the Optimization Problem
In ordinary regression, the fitting objective is usually something like:
\[
\min_{\beta} \sum_{i=1}^n (y_i - \hat{y}_i)^2
\] Regularization modifies that objective by adding a penalty term.
The model is no longer rewarded only for fit. It is also penalized for coefficient magnitude.
That means the fitting problem becomes:
\[
\text{loss} + \text{penalty}
\] This is the key conceptual move.
Instead of asking for the best-fitting unconstrained model, regularization asks for the best-fitting model subject to restraint.
That restraint is what stabilizes the fit.
L2 Regularization Shrinks Coefficients Smoothly
L2 regularization, commonly called ridge regression, adds a penalty based on the squared size of the coefficients.
\[
\min_{\beta} \sum_{i=1}^n (y_i - \hat{y}*i)^2 + \lambda \sum*{j=1}^p \beta_j^2
\] Here, () controls how strongly the penalty is applied.
As () increases:
coefficients shrink more strongly toward zero
the model becomes less flexible
variance is reduced
but bias may increase
Ridge regression is especially useful when predictors are correlated and the analyst wants to stabilize estimation without necessarily removing variables.
L1 Regularization Encourages Sparsity
L1 regularization, commonly called lasso, adds a penalty based on the absolute values of the coefficients.
\[
\min_{\beta} \sum_{i=1}^n (y_i - \hat{y}*i)^2 + \lambda \sum*{j=1}^p |\beta_j|
\] This penalty has a different effect from ridge.
Instead of only shrinking coefficients, lasso often drives some coefficients exactly to zero.
That means lasso performs both:
shrinkage
and variable selection
This is one reason lasso is so popular in high-dimensional settings. It can produce a simpler, sparser model while still controlling overfitting.
Elastic Net Combines L1 and L2 Penalties
Sometimes ridge is too soft and lasso is too aggressive.
This model may fit the training data reasonably well, but it is vulnerable to instability because:
several predictors are irrelevant,
some predictors are correlated,
and the model is free to allocate weight aggressively.
This is exactly the kind of situation where regularization becomes useful.
Ridge, Lasso, and Elastic Net Are Easy to Fit with glmnet
The glmnet framework made penalized regression workflows broadly accessible in applied work and remains a standard implementation for ridge, lasso, and elastic net models (Friedman et al. 2010).
A practical way to fit regularized regression in R is the glmnet package.
lambda.1se: a more regularized value within one standard error of the minimum
The second is often preferred when a simpler model is desirable.
Lasso Produces Sparsity, Which Helps with Feature Selection
A major reason lasso is so popular is that it can set coefficients exactly to zero.
That means it does not only stabilize prediction. It can also help identify a smaller subset of predictors.
coef(cv_lasso, s ="lambda.min")coef(cv_lasso, s ="lambda.1se")
This is especially useful when:
predictors are numerous,
interpretation matters,
and a sparse model is easier to communicate or deploy.
That said, lasso-based selection should still be interpreted carefully. Selected predictors are not automatically “the true variables.” They are the variables favored under the data, penalty, and tuning structure.
Ridge Helps When Predictors Are Correlated
When predictors are highly correlated, lasso can behave somewhat erratically, sometimes selecting one variable and dropping another similar one.
Ridge handles this more gently.
Because ridge shrinks coefficients without forcing hard zeros, it often distributes weight more smoothly across correlated predictors.
This makes ridge especially useful when:
prediction is the main goal,
collinearity is strong,
and a stable fit matters more than sparse selection.
So the choice between ridge and lasso is partly about the analytic goal:
ridge for stability
lasso for sparsity
elastic net when both matter
Elastic Net Is Often a Good Compromise in Real Data
Elastic net is particularly attractive when predictors are both:
numerous
and correlated
In those settings, pure lasso may be too unstable and pure ridge may be too dense.
This is the practical point of regularization: not prettier coefficients, but better out-of-sample behavior.
Regularization Is One of the Most Practical Answers to Overfitting
If the bias-variance tradeoff explains the problem, regularization is one of the most important practical answers.
It helps by:
shrinking unstable coefficients
discouraging extreme fits
improving generalization
and making models more robust in high-dimensional settings
That is why regularization is foundational in modern ML.
It is not an optional refinement. It is often part of what makes the model usable at all.
A Practical Checklist for Applied Work
Before fitting a regularized model, ask:
Is overfitting a realistic concern here?
Are predictors numerous relative to sample size?
Is multicollinearity present?
Is prediction or sparse selection the main goal?
Should ridge, lasso, or elastic net be the starting point?
Has lambda been tuned with cross-validation?
Am I interpreting selected variables too confidently?
Would a simpler baseline or stronger regularization generalize better?
These questions usually improve both modeling and interpretation.
NoteWhere This Shows Up in AI/ML
Lasso regularization is used directly in trauma mortality prediction pipelines that source features from the DoDTR — where hundreds of injury descriptors, vitals, and procedure codes create exactly the high-dimensional, collinear setting lasso was designed for, and where sparse coefficient selection improves model transportability across MTFs. Every production clinical ML model trained in Epic’s model marketplace applies L2 regularization (weight decay) to penalize coefficient magnitude and control overfitting to the training institution’s case mix. When a model is deployed without adequate regularization — or with a lambda tuned only on a single-site training cohort — the coefficients fit the training noise rather than the signal, and performance degrades sharply the moment the patient population shifts. A regularized model shipped from a Level I trauma center to a forward surgical team is not the same as an unregularized model: the former has constrained its ambition; the latter has memorized a hospital it will never see again.
Closing: Regularization Adds Discipline to Flexible Modeling
Regularization remains one of the most useful ideas in machine learning because it confronts a basic danger directly:
models that fit too freely often generalize too poorly.
Ridge regression stabilizes coefficients. Lasso encourages sparsity. Elastic net blends both ideas. Cross-validation helps tune the strength of restraint.
Together, these methods show that better prediction is often not about fitting harder (Hastie et al. 2009; James et al. 2021). It is about fitting more responsibly.
Regularization matters because a model needs enough flexibility to learn the signal, but enough discipline not to memorize the noise.
This post is part of the Prediction Modeling Toolkit — a companion reference with ridge, lasso, and elastic net templates, lambda tuning code, and shrinkage diagnostics for high-dimensional clinical data.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.”Journal of Statistical Software 33 (1): 1–22.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.”Technometrics 12 (1): 55–67.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in R. 2nd ed. Springer.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society: Series B 58 (1): 267–88.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.”Journal of the Royal Statistical Society: Series B 67 (2): 301–20.