Adam & Modern Optimizers Intermediate

Adaptive optimizers automatically adjust the learning rate for each parameter based on the history of gradients. Adam (Adaptive Moment Estimation) combines the best ideas from AdaGrad and RMSProp, and is the default choice for most deep learning tasks today.

The Evolution of Optimizers

Optimizer Key Idea Year
SGD Basic gradient descent with constant learning rate 1951
SGD + Momentum Accumulate velocity to smooth updates 1964
AdaGrad Per-parameter learning rates based on gradient history 2011
RMSProp Fix AdaGrad's diminishing learning rates with decay 2012
Adam Combine momentum + adaptive rates + bias correction 2015
AdamW Fix Adam's weight decay implementation 2019

Adam Implementation

Python
import numpy as np

def adam(gradient_fn, x0, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=100):
    x = x0.copy()
    m = np.zeros_like(x)  # First moment (mean of gradients)
    v = np.zeros_like(x)  # Second moment (mean of squared gradients)

    for t in range(1, n_steps + 1):
        g = gradient_fn(x)

        # Update moments
        m = beta1 * m + (1 - beta1) * g         # Momentum
        v = beta2 * v + (1 - beta2) * g**2       # Adaptive rates

        # Bias correction (critical for early steps)
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)

        # Update parameters
        x = x - lr * m_hat / (np.sqrt(v_hat) + eps)

    return x

Using Optimizers in PyTorch

Python
import torch
import torch.optim as optim

model = MyModel()

# Adam: default choice for most tasks
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW: Adam with proper weight decay (for transformers)
optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# SGD with momentum: often better final performance
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)

When to Use What

Practical Guidelines:
  • Start with AdamW (lr=1e-3 or 3e-4) for most deep learning tasks
  • Use SGD + momentum for computer vision (ResNets, etc.) when you can tune the learning rate schedule
  • Use AdamW for transformers and NLP tasks
  • Avoid plain Adam — use AdamW instead, which handles weight decay correctly

Next Up: Convex Optimization

Learn the theoretical foundations: when can we guarantee finding the global optimum?

Next: Convex Optimization →