Adam Optimizer

Photo by Koushik Chowdavarapu on Unsplash
Photo by Koushik Chowdavarapu on Unsplash
When training neural networks, choosing a good optimizer is critically important. Adam is one of the most commonly used optimizers, so that it has almost become the default choice. Adam is built upon the foundations of SGD, Momentum, and RMSprop. By revisiting the evolution of these methods, we can better understand the principles behind Adam.

When training neural networks, choosing a good optimizer is critically important. Adam is one of the most commonly used optimizers, so that it has almost become the default choice. Adam is built upon the foundations of SGD, Momentum, and RMSprop. By revisiting the evolution of these methods, we can better understand the principles behind Adam.

What are Optimizers?

In machine learning, training a model is essentially an optimization problem. The goal is to:

Find a set of parameters \theta that minimizes the loss function J(\theta).

This is a classic minimization problem in mathematics:

\displaystyle\min_\theta J(\theta)

While this is similar to many traditional optimization problems, the loss function J(\theta) in machine learning is often high-dimensional, nonlinear, and non-convex. Moreover, the parameter space is typically very large.

Since we cannot analytically solve for the minimum of such complex functions, we rely on iterative methods like gradient descent to approximate the solution. An optimizer is essentially an algorithm that determines how the parameters should be updated at each step. It plays a central role in the training process.

For gradient descent-based optimizers, the main tasks are:

  • Compute the gradient: Based on the current parameters \theta_t, calculate the gradient of the loss function \nabla_\theta J(\theta_t).
  • Compute the update: Use the gradient and internal strategies (such as Adam) to compute the direction and magnitude of the update.
  • Update the parameters: Apply the new parameters \theta_{t+1} back to the model and continue to the next step.

The job of an optimizer is not to learn directly, but to provide an efficient mechanism for updating the model’s parameters during learning. Given this role, you might think of it as a learner or converger, but from a mathematical perspective, the core of the task is still minimizing a function, that is precisely the focus of the field of optimization.

Stochastic Gradient Descent (SGD) Optimizer

Stochastic Gradient Descent (SGD) is the most fundamental optimization method. The core idea is simple:

Use the gradient of the loss function with respect to the parameters to take a step in the opposite direction, thereby reducing the loss.

The update formula is as follows:

\theta = \theta - \alpha \nabla_\theta J(\theta)

where:

  • \theta: the model parameters.
  • \alpha: the learning rate.
  • \nabla_\theta J(\theta): the current gradient.

In large datasets, we typically don’t compute the gradient over the entire dataset each time. Instead, we use small batches (mini-batches) of data. This is where the term stochastic comes fro. It refers to the randomness introduced by using only a subset of data at each step.

Exponential Moving Average (EMA)

Before diving into Momentum and Adam, it’s important to first understand Exponential Moving Average (EMA). The formula is as follows:

v_t = \beta v_{t-1} + (1 - \beta) \theta_t

where:

  • \theta_t: the current value.
  • v_t: the exponentially weighted moving average.
  • \beta: the smoothing parameter.

The influence of each value decays exponentially over time. More recent values have greater weight, while older values still contribute with diminishing influence. As a result, EMA can smooth out noise in data, such as occasional spikes in gradient, while still preserving overall trends.

Momentum Optimizer

The idea behind Momentum comes from the concept of momentum in physics. By combining EMA, it adds inertia to the gradient. The update formula is:

v_t = \beta v_{t-1} + (1 - \beta) \nabla_\theta J(\theta) \\\\ \theta = \theta - \alpha v_t

This method helps accelerate updates in directions where the gradient is consistently pointing the same way, and dampens oscillations in directions where the gradient fluctuates.

RMSprop (Root Mean Square Prop) Optimizer

Root Mean Square Prop (RMSprop) doesn’t focus on direction like Momentum does. Instead, it allows each parameter dimension to have its own adaptive learning rate. The update formula is:

s_t = \beta s_{t-1} + (1 - \beta) [\nabla_\theta J(\theta)]^2 \\\\ \theta = \theta - \frac{\alpha}{\sqrt{s_t + \varepsilon}} \nabla_\theta J(\theta)

where:

  • s_t: the moving average of the squared gradients for each parameter dimension.
  • \varepsilon: a small constant to avoid division by zero.

The square root operation causes updates to shrink automatically in dimensions with large gradients, while still allowing small gradient dimensions to update meaningfully. RMSprop enables the learning rate to adapt individually for each dimension, which helps reduce oscillations and improves convergence, especially when gradient magnitudes vary greatly across parameters.

Adam (Adaptive Moment Estimation) Optimizer

Adaptive Moment Estimation(Adam) combines the ideas of Momentum and RMSprop. It simultaneously tracks the exponentially weighted average of the gradient (like Momentum) and the squared gradient (like RMSprop):

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_\theta J(\theta) \\\\ v_t = \beta_2 v_{t-1} + (1 - \beta_2) [\nabla_\theta J(\theta)]^2

However, these estimates are biased toward zero in the early steps, especially when t is small. To correct for this, Adam applies bias correction:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \\\\ \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\\\ \theta = \theta - \frac{\alpha}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t

This design allows Adam to automatically adjust the step size for each parameter, leading to fast and stable convergence.

L2 Regularization and Weight Decay

L2 Regularization

When training neural networks, we want the model to avoid memorizing the training data too closely, as known as overfitting. To encourage the model to learn smoother, more generalizable solutions, we often add a regularization term to the loss function. The most common form is L2 regularization.

L2 regularization penalizes the squared values of the parameters:

J_{\text{reg}}(\theta) = J(\theta) + \frac{\lambda}{2} \displaystyle\sum_i \theta_i^2

where:

  • J(\theta): the original loss function.
  • \lambda: the regularization coefficient that controls the strength of the penalty.
  • \theta_i: the model parameters.

This penalty encourages the model to find solutions with smaller weights, which reduces reliance on any single input feature and helps prevent overfitting.

Weight Decay

In practice, L2 regularization can be implemented either by modifying the loss function directly or by adjusting the parameter update rule. The latter is called weight decay:

\theta = \theta - \alpha \left( \nabla_\theta J(\theta) + \lambda \theta \right)

This formula is mathematically equivalent to L2 regularization in SGD. The term \lambda \theta is added directly to the gradient, causing the weights to shrink slightly with each update. Hence, it is called weight decay.

In Adam, L2 Regularization ≠ Weight Decay!

While L2 regularization and weight decay are equivalent in standard SGD, this is not the case with Adam.

Adam updates parameters not just based on the raw gradient but also by applying bias-corrected smoothing and per-parameter adaptive scaling. In other words, the learning rate is different for each dimension.

If you apply L2 regularization by adding \lambda \theta to the loss function (or directly into the gradient), that penalty will also be subject to Adam’s adaptive scaling and momentum. As a result, the regularization effect gets distorted, and some parameters may be over-penalized, while others are barely affected. This unintended behavior breaks the original purpose of regularization.

AdamW (Adam with Decoupled Weight Decay) Optimizer

To address the issue described above, Adam with decoupled weight decay (AdamW) was proposed in 2017. It separates the weight decay term \lambda \theta from the gradient computation and applies it directly to the parameters:

\theta = \theta - \left( \frac{\alpha\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \alpha \lambda \theta \right)

The benefit of this approach is that the regularization effect is no longer influenced by the gradient momentum or adaptive scaling. It also allows the strength of regularization to be controlled independently. This correction restores the intended behavior of weight decay. In practice, frameworks like PyTorch now include AdamW as a built-in optimizer, and it has gradually replaced the original Adam as the standard choice in modern training.

Hyperparameters of AdamW

The AdamW formula involves several hyperparameters, and it can be unclear how to set them when you’re just getting started. Below are some commonly used values:

ParametersDescriptionsCommonly used values
\alphaLearning Rate0.001
\beta_1Momentum0.9
\beta_2RMSprop0.999
\varepsilonPrevent division by 010^{-8}
\lambdaWeight Decay0.01

Here is an example of initializing AdamW in PyTorch:

from torch.optim import AdamW

optimizer = AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01
)

Conclusion

Adam is the result of several generations of optimizer evolution, combining the directional stability of Momentum with the adaptive learning rate mechanism of RMSprop. It has become a mainstream choice thanks to its fast convergence and strong adaptability. AdamW goes a step further by resolving the issue of ineffective L2 regularization in Adam, making it one of the most reliable optimizers for modern neural network training.

References

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like