Optimization Essentials: The Engine of Modern AI

Applied Statistics

AI and Clinical Decision-Making

A practical introduction to gradients, gradient descent, SGD, mini-batch updates, and learning-rate ideas that power modern AI model training.

Published

February 15, 2025

Modified

June 9, 2026

Executive Summary

Modern machine learning models do not usually solve themselves.

They are trained.

That training process is, at its core, an optimization problem.

A model begins with unknown parameters. A loss function tells us how poorly the model is performing. An optimizer updates the parameters to reduce that loss.

This is where gradient descent enters the story (Boyd and Vandenberghe 2004; Goodfellow et al. 2016).

Gradient-based optimization is one of the most important ideas in machine learning because it provides a general mechanism for fitting models when closed-form solutions are unavailable or impractical.

These methods matter because they power:

linear and logistic regression fitting under some formulations,
neural network training,
deep learning optimization,
regularized objective functions,
and many modern large-scale predictive systems.

This post introduces:

gradients,
batch gradient descent,
stochastic gradient descent,
mini-batch updates,
learning rates,
and why these ideas remain central even when more advanced optimizers such as Adam are used.

Optimization matters because even the best model architecture is useless if its parameters cannot be trained toward a better solution.

Optimization Is the Hidden Engine of Model Training

Many modeling workflows present the final fitted model as if it simply appears after a function call.

But underneath that function call is often an optimization problem.

At a high level, model training asks:

what parameter values minimize the loss?
how should we move through parameter space?
how fast should those updates happen?
and how do we avoid getting stuck or diverging?

These questions are not only computational. They shape whether a model learns well at all.

This is why optimization is not a side topic in AI/ML. It is the engine of training.

A Loss Function Gives the Model Something to Improve

Optimization requires an objective.

In supervised learning, that objective is often a loss function.

Examples include:

mean squared error for regression
log loss for classification
negative log-likelihood for probabilistic models

The model is trained by choosing parameters that reduce this loss.

That means optimization is never floating in the abstract. It is always tied to a specific objective.

So before asking how gradient descent works, it helps to ask:

what exactly is being minimized?

That is the quantity the optimizer is trying to improve.

The Gradient Points in the Direction of Steepest Increase

The key mathematical object in gradient-based optimization is the gradient.

For a function of one variable, the derivative tells us the slope.

For a function of multiple parameters, the gradient collects the partial derivatives:

\[ \nabla L(\theta) = \left( \frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \dots, \frac{\partial L}{\partial \theta_p} \right) \] This vector points in the direction of steepest increase of the loss.

If we want to minimize the loss, we should move in the opposite direction.

That is the central idea of gradient descent.

Gradient Descent Updates Parameters Step by Step

The basic gradient descent update is:

\[ \theta^{(t+1)} = \theta^{(t)} - \eta \nabla L(\theta^{(t)}) \] where:

(^{(t)}) is the current parameter vector
() is the learning rate
(L(^{(t)})) is the gradient at the current point

This equation says:

compute the gradient
take a step in the opposite direction
repeat until the loss stabilizes or convergence is reached

That is a surprisingly simple idea. But it powers a huge share of modern machine learning.

A One-Parameter Example Makes the Idea Concrete

Suppose we want to minimize the simple quadratic function:

\[ L(\theta) = (\theta - 3)^2 \] Its derivative is:

\[ \frac{dL}{d\theta} = 2(\theta - 3) \] The minimum clearly occurs at (= 3), but we will use gradient descent to get there numerically.

loss_fn <- function(theta) {
  (theta - 3)^2
}

grad_fn <- function(theta) {
  2 * (theta - 3)
}

theta <- -5
eta <- 0.1
n_steps <- 25

gd_path <- tibble::tibble(
  step = 0,
  theta = theta,
  loss = loss_fn(theta)
)

for (i in 1:n_steps) {
  theta <- theta - eta * grad_fn(theta)
  gd_path <- dplyr::bind_rows(
    gd_path,
    tibble::tibble(
      step = i,
      theta = theta,
      loss = loss_fn(theta)
    )
  )
}

gd_path

# A tibble: 26 × 3
    step  theta  loss
   <dbl>  <dbl> <dbl>
 1     0 -5     64   
 2     1 -3.4   41.0 
 3     2 -2.12  26.2 
 4     3 -1.10  16.8 
 5     4 -0.277 10.7 
 6     5  0.379  6.87
 7     6  0.903  4.40
 8     7  1.32   2.81
 9     8  1.66   1.80
10     9  1.93   1.15
# ℹ 16 more rows

Now visualize convergence.

ggplot2::ggplot(gd_path, ggplot2::aes(x = step, y = theta)) +
  ggplot2::geom_line(linewidth = 0.8) +
  ggplot2::geom_point(size = 2) +
  ggplot2::geom_hline(yintercept = 3, linetype = 2) +
  ggplot2::labs(
    title = "Gradient Descent Converging to the Minimum",
    x = "Iteration",
    y = expression(theta)
  ) +
  ggplot2::theme_minimal()

This shows the parameter moving toward the optimal value iteratively.

The Learning Rate Controls How Aggressive the Updates Are

The learning rate, often written (), is one of the most important tuning parameters in gradient-based optimization.

It controls the size of each update step.

If the learning rate is too small:

learning can be painfully slow

If the learning rate is too large:

updates may overshoot the minimum
the algorithm may oscillate
or it may diverge entirely

This is why learning rate choice can strongly affect whether training succeeds.

We can compare several learning rates.

run_gd <- function(theta_init, eta, n_steps = 20) {
  theta <- theta_init
  out <- tibble::tibble(step = 0, theta = theta, eta = paste0("eta = ", eta))
  
  for (i in 1:n_steps) {
    theta <- theta - eta * grad_fn(theta)
    out <- dplyr::bind_rows(
      out,
      tibble::tibble(step = i, theta = theta, eta = paste0("eta = ", eta))
    )
  }
  
  out
}

eta_df <- dplyr::bind_rows(
  run_gd(theta_init = -5, eta = 0.02),
  run_gd(theta_init = -5, eta = 0.10),
  run_gd(theta_init = -5, eta = 0.60)
)

ggplot2::ggplot(eta_df, ggplot2::aes(x = step, y = theta, color = eta)) +
  ggplot2::geom_line(linewidth = 0.8) +
  ggplot2::geom_hline(yintercept = 3, linetype = 2) +
  ggplot2::labs(
    title = "Effect of Learning Rate on Convergence",
    x = "Iteration",
    y = expression(theta)
  ) +
  ggplot2::theme_minimal()

This is one of the clearest ways to show why optimization is sensitive to tuning.

Gradient Descent Can Be Applied to Linear Regression

A familiar loss function is the regression mean squared error:

\[ L(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1 x_i))^2 \] Although linear regression has a closed-form solution under ordinary least squares, it is still useful to fit it with gradient descent as a teaching device.

This shows how iterative optimization works in a real model.

n <- 150

opt_df <- tibble::tibble(
  x = rnorm(n, mean = 0, sd = 1)
) |>
  dplyr::mutate(
    y = 2 + 1.8 * x + rnorm(n, mean = 0, sd = 1)
  )

ggplot2::ggplot(opt_df, ggplot2::aes(x = x, y = y)) +
  ggplot2::geom_point(alpha = 0.7) +
  ggplot2::theme_minimal() +
  ggplot2::labs(
    title = "Simulated Regression Data",
    x = "x",
    y = "y"
  )

Now define the loss and gradients.

mse_loss <- function(b0, b1, x, y) {
  mean((y - (b0 + b1 * x))^2)
}

grad_b0 <- function(b0, b1, x, y) {
  -2 * mean(y - (b0 + b1 * x))
}

grad_b1 <- function(b0, b1, x, y) {
  -2 * mean(x * (y - (b0 + b1 * x)))
}

Run batch gradient descent.

b0 <- 0
b1 <- 0
eta <- 0.1
n_steps <- 60

reg_path <- tibble::tibble(
  step = 0,
  b0 = b0,
  b1 = b1,
  loss = mse_loss(b0, b1, opt_df$x, opt_df$y)
)

for (i in 1:n_steps) {
  b0 <- b0 - eta * grad_b0(b0, b1, opt_df$x, opt_df$y)
  b1 <- b1 - eta * grad_b1(b0, b1, opt_df$x, opt_df$y)
  
  reg_path <- dplyr::bind_rows(
    reg_path,
    tibble::tibble(
      step = i,
      b0 = b0,
      b1 = b1,
      loss = mse_loss(b0, b1, opt_df$x, opt_df$y)
    )
  )
}

reg_path |>
  dplyr::slice_tail(n = 5)

# A tibble: 5 × 4
   step    b0    b1  loss
  <dbl> <dbl> <dbl> <dbl>
1    56  1.94  1.87 0.955
2    57  1.94  1.87 0.955
3    58  1.94  1.87 0.955
4    59  1.94  1.87 0.955
5    60  1.94  1.87 0.955

Plot the loss over time.

ggplot2::ggplot(reg_path, ggplot2::aes(x = step, y = loss)) +
  ggplot2::geom_line(linewidth = 0.9) +
  ggplot2::labs(
    title = "Batch Gradient Descent Loss Over Iterations",
    x = "Iteration",
    y = "MSE Loss"
  ) +
  ggplot2::theme_minimal()

This shows the optimizer progressively reducing the loss.

Batch Gradient Descent Uses the Full Dataset Each Update

The version above is batch gradient descent.

At each iteration, it computes the gradient using the full dataset.

Advantages:

stable updates
exact gradient for the current parameters
conceptually simple

Disadvantages:

computationally expensive for large datasets
can be slow when data are massive

This is one reason machine learning moved toward stochastic and mini-batch variants for large-scale problems.

Stochastic Gradient Descent Uses One Observation at a Time

The stochastic approximation idea underlying SGD goes back to Robbins and Monro and remains one of the core conceptual bridges from classical optimization to modern large-scale learning (Robbins and Monro 1951).

Stochastic gradient descent, or SGD, updates the parameters using just one observation at a time.

Instead of computing a full gradient over all (n) points, SGD uses a noisy estimate of the gradient from a single case.

This makes updates:

much cheaper
noisier
often faster in wall-clock terms
and capable of escaping some optimization traps

Conceptually, SGD trades precision in the update for computational speed.

That tradeoff is one of the reasons it became so important in large-scale learning.

A Simple SGD Example Shows the Noisier Path

We can implement SGD for the same regression problem.

b0 <- 0
b1 <- 0
eta <- 0.05
n_epochs <- 20

sgd_path <- tibble::tibble(
  epoch = 0,
  b0 = b0,
  b1 = b1,
  loss = mse_loss(b0, b1, opt_df$x, opt_df$y)
)

for (epoch in 1:n_epochs) {
  idx <- sample(seq_len(nrow(opt_df)))
  
  for (i in idx) {
    xi <- opt_df$x[i]
    yi <- opt_df$y[i]
    
    pred_i <- b0 + b1 * xi
    err_i <- yi - pred_i
    
    b0 <- b0 - eta * (-2 * err_i)
    b1 <- b1 - eta * (-2 * xi * err_i)
  }
  
  sgd_path <- dplyr::bind_rows(
    sgd_path,
    tibble::tibble(
      epoch = epoch,
      b0 = b0,
      b1 = b1,
      loss = mse_loss(b0, b1, opt_df$x, opt_df$y)
    )
  )
}

sgd_path

# A tibble: 21 × 4
   epoch    b0    b1  loss
   <dbl> <dbl> <dbl> <dbl>
 1     0  0     0    8.41 
 2     1  1.86  1.64 1.01 
 3     2  1.82  1.79 0.976
 4     3  2.11  1.84 0.983
 5     4  2.04  1.72 0.988
 6     5  1.95  1.93 0.959
 7     6  2.10  1.81 0.982
 8     7  2.07  2.14 1.05 
 9     8  1.97  1.85 0.956
10     9  2.00  1.85 0.959
# ℹ 11 more rows

Plot the loss by epoch.

ggplot2::ggplot(sgd_path, ggplot2::aes(x = epoch, y = loss)) +
  ggplot2::geom_line(linewidth = 0.9) +
  ggplot2::geom_point(size = 2) +
  ggplot2::labs(
    title = "Stochastic Gradient Descent Loss by Epoch",
    x = "Epoch",
    y = "MSE Loss"
  ) +
  ggplot2::theme_minimal()

Compared with batch gradient descent, SGD often looks noisier, but can still converge effectively.

Mini-Batch Gradient Descent Balances Stability and Speed

In modern machine learning, the most common variant is often mini-batch gradient descent.

Instead of using:

all observations, or
only one observation,

mini-batch GD uses a small random subset of observations for each update.

This balances:

computational efficiency
gradient stability
memory constraints
and optimization speed

Mini-batching is especially important in neural network training because full-batch updates are often too costly and pure SGD can be too noisy.

This is why mini-batch training became such a standard in deep learning.

Learning Rate Schedules Can Improve Training

A fixed learning rate is simple, but often not ideal.

Sometimes we want:

larger steps early in training
smaller steps later as the optimizer approaches a minimum

This is why many workflows use learning rate schedules, such as:

step decay
exponential decay
cosine schedules
warmup strategies

The principle is intuitive:

move quickly when far away, then move more carefully near the solution.

This matters because a learning rate that is too aggressive late in training can prevent clean convergence.

Optimization Can Be Visualized as Movement on a Loss Surface

One of the best ways to understand optimization is to visualize it as a path through parameter space.

For a two-parameter regression model, we can compute the loss surface over a grid of ((b0, b1)) values and overlay the optimization path.

b0_grid <- seq(-1, 5, length.out = 80)
b1_grid <- seq(-1, 4, length.out = 80)

surface_df <- expand.grid(b0 = b0_grid, b1 = b1_grid) |>
  tibble::as_tibble() |>
  dplyr::mutate(
    loss = purrr::map2_dbl(
      b0, b1,
      ~ mse_loss(.x, .y, opt_df$x, opt_df$y)
    )
  )

Plot the contour and optimization path.

ggplot2::ggplot(surface_df, ggplot2::aes(x = b0, y = b1, z = loss)) +
  ggplot2::geom_contour(bins = 20) +
  ggplot2::geom_path(
    data = reg_path,
    ggplot2::aes(x = b0, y = b1),
    color = "red",
    linewidth = 0.8
  ) +
  ggplot2::geom_point(
    data = reg_path,
    ggplot2::aes(x = b0, y = b1),
    color = "red",
    size = 1
  ) +
  ggplot2::labs(
    title = "Gradient Descent Path on the Loss Surface",
    x = expression(beta[0]),
    y = expression(beta[1])
  ) +
  ggplot2::theme_minimal()

This is one of the most intuitive ways to show what optimization is doing geometrically.

Optimization Problems Can Be Harder Than This Toy Example

The examples so far are friendly.

Real optimization can be much messier.

Challenges include:

non-convex loss surfaces
saddle points
flat regions
exploding gradients
vanishing gradients
noisy updates
badly scaled features

This is why optimization in deep learning is not just “run gradient descent and wait.” It often requires thoughtful tuning, normalization, and algorithm choice (Goodfellow et al. 2016).

Still, the basic gradient idea remains the foundation.

Feature Scaling Often Helps Gradient-Based Optimization

A very practical lesson is that optimization behaves better when predictors are on comparable scales.

If one feature has a much larger scale than another, the loss surface can become stretched and poorly conditioned.

This makes optimization slower and less stable.

That is why standardization is often helpful before gradient-based training.

Feature scaling does not change the conceptual model, but it can dramatically improve how efficiently the optimizer moves through parameter space.

This is especially important in:

linear models fit iteratively
neural networks
regularized optimization
and distance-sensitive ML methods

SGD Is the Bridge to Modern Deep Learning

Why is SGD so central in AI/ML?

Because modern models can involve:

millions of parameters
huge training datasets
and objectives that cannot be solved analytically

In those settings, batch-style exact optimization is often too slow.

SGD and mini-batch methods make training feasible.

That is why these techniques are not just classroom algorithms. They are the operational core of large-scale model training.

Neural networks, in particular, depend heavily on stochastic optimization.

Advanced Optimizers Extend the Same Core Logic

Methods such as Adam retain the same update logic while adapting step sizes using running moments of the gradients (Kingma and Ba 2015).

Modern optimizers such as:

Momentum
RMSProp
Adam

do not replace gradient descent conceptually. They extend it.

These methods still use gradients, but they add mechanisms such as:

adaptive step sizes
momentum accumulation
running averages of gradient information

This can improve convergence, especially in large, noisy, or poorly conditioned problems.

But the foundation remains the same:

compute gradients
update parameters
reduce the loss

That is why understanding plain gradient descent is still essential.

Optimization Is Not Just a Technical Detail

One of the biggest mistakes in applied ML is to treat optimization as if it were only a software implementation detail.

It is not.

Optimization affects:

whether the model converges
how stable the estimates are
how long training takes
and whether the final fitted solution is actually useful

This is especially important when comparing model architectures.

Sometimes a “better” model fails in practice because its training dynamics are poor. Sometimes a simpler model succeeds because it is easier to optimize reliably.

That is why optimization deserves conceptual attention, not only code execution.

A Practical Checklist for Applied Work

Before training a gradient-based model, ask:

What loss function is being optimized?
Are the gradients available analytically or through automatic differentiation?
Is the learning rate appropriate?
Should training use batch, stochastic, or mini-batch updates?
Are the features scaled appropriately?
Is the loss decreasing in a stable way?
Would a learning rate schedule help?
Is a more advanced optimizer such as Adam warranted?

These questions often matter as much as the model architecture itself.

Where This Shows Up in AI/ML

The Adam optimizer — a gradient descent variant using adaptive per-parameter learning rates — is the default training algorithm for virtually every clinical NLP model applied to trauma documentation, including ICD code prediction from free-text operative notes and injury severity extraction from TCCC cards. When the learning rate is set too high during fine-tuning of a pretrained model on a small trauma registry dataset, the optimizer overshoots the loss minimum and produces erratic weight updates that can destroy the pretrained representations rather than refine them — a failure mode called catastrophic forgetting. Clinicians seeing a model that performed well in development perform erratically in deployment often cannot trace the problem to optimizer misconfiguration, but that is frequently what happened upstream.

Closing: Optimization Turns Models into Trained Systems

Optimization remains one of the most important ideas in modern AI because it is what turns a model specification into a trained system.

Gradient descent provides the core logic. SGD and mini-batch methods make large-scale learning practical. Learning rates determine how updates behave. Advanced optimizers extend the same principles to more difficult training settings.

This is why optimization is not merely background machinery. It is a central part of how modern models actually learn.

Optimization matters because building a model is only the beginning, and training it well is what makes the model useful.

📚 Go Deeper: Prediction Modeling Toolkit

This post is part of the Prediction Modeling Toolkit — a companion reference with gradient descent implementation templates, learning rate selection guidance, and loss surface diagnostic scaffolds.

→ Open the Prediction Modeling Toolkit

Series Callout

Note

This post is part of a broader Applied Statistics for AI and Clinical Decision-Making Series:

Probability fundamentals for machine learning
Random variables and expectation
Common probability distributions
Central Limit Theorem
Law of Large Numbers
Sampling methods for Biostats and ML
Hypothesis testing in the age of AI
Confidence intervals
Maximum likelihood estimation
Bayesian inference
Linear regression
Logistic regression
Generalized linear models
Analysis of variance
Principal component analysis
Cluster analysis
Time series analysis
Survival analysis
Non-parametric methods
Bias-variance tradeoff
Regularization
Cross-validation
Information theory
Optimization techniques
Linear algebra basics
Calculus for ML
Monte Carlo methods
Dimensionality curse and reduction techniques
Model evaluation metrics
Ensemble methods

Series: Applied Statistics for AI & Clinical Decision-Making

← Entropy in Stats: Measuring Uncertainty for Smarter AI | Linear Algebra for Stats Pros: Fueling AI Computations →

References

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” International Conference on Learning Representations.

Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” The Annals of Mathematical Statistics 22 (3): 400–407.