Adam & Modern Optimizers Intermediate

Adaptive optimizers automatically adjust the learning rate for each parameter based on the history of gradients. Adam (Adaptive Moment Estimation) combines the best ideas from AdaGrad and RMSProp, and is the default choice for most deep learning tasks today.

The Evolution of Optimizers

Optimizer	Key Idea	Year
SGD	Basic gradient descent with constant learning rate	1951
SGD + Momentum	Accumulate velocity to smooth updates	1964
AdaGrad	Per-parameter learning rates based on gradient history	2011
RMSProp	Fix AdaGrad's diminishing learning rates with decay	2012
Adam	Combine momentum + adaptive rates + bias correction	2015
AdamW	Fix Adam's weight decay implementation	2019

Adam Implementation

Python

import numpy as np

def adam(gradient_fn, x0, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=100):
    x = x0.copy()
    m = np.zeros_like(x)  # First moment (mean of gradients)
    v = np.zeros_like(x)  # Second moment (mean of squared gradients)

    for t in range(1, n_steps + 1):
        g = gradient_fn(x)

        # Update moments
        m = beta1 * m + (1 - beta1) * g         # Momentum
        v = beta2 * v + (1 - beta2) * g**2       # Adaptive rates

        # Bias correction (critical for early steps)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)

        # Update parameters
        x = x - lr * m_hat / (np.sqrt(v_hat) + eps)

    return x

Using Optimizers in PyTorch

Python

import torch
import torch.optim as optim

model = MyModel()

# Adam: default choice for most tasks
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW: Adam with proper weight decay (for transformers)
optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# SGD with momentum: often better final performance
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)

When to Use What

Practical Guidelines:

Start with AdamW (lr=1e-3 or 3e-4) for most deep learning tasks
Use SGD + momentum for computer vision (ResNets, etc.) when you can tune the learning rate schedule
Use AdamW for transformers and NLP tasks
Avoid plain Adam — use AdamW instead, which handles weight decay correctly

Next Up: Convex Optimization

Learn the theoretical foundations: when can we guarantee finding the global optimum?

Next: Convex Optimization →

← Gradient Descent Convex Optimization →