Adam & Modern Optimizers Intermediate
Adaptive optimizers automatically adjust the learning rate for each parameter based on the history of gradients. Adam (Adaptive Moment Estimation) combines the best ideas from AdaGrad and RMSProp, and is the default choice for most deep learning tasks today.
The Evolution of Optimizers
| Optimizer | Key Idea | Year |
|---|---|---|
| SGD | Basic gradient descent with constant learning rate | 1951 |
| SGD + Momentum | Accumulate velocity to smooth updates | 1964 |
| AdaGrad | Per-parameter learning rates based on gradient history | 2011 |
| RMSProp | Fix AdaGrad's diminishing learning rates with decay | 2012 |
| Adam | Combine momentum + adaptive rates + bias correction | 2015 |
| AdamW | Fix Adam's weight decay implementation | 2019 |
Adam Implementation
Python
import numpy as np def adam(gradient_fn, x0, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=100): x = x0.copy() m = np.zeros_like(x) # First moment (mean of gradients) v = np.zeros_like(x) # Second moment (mean of squared gradients) for t in range(1, n_steps + 1): g = gradient_fn(x) # Update moments m = beta1 * m + (1 - beta1) * g # Momentum v = beta2 * v + (1 - beta2) * g**2 # Adaptive rates # Bias correction (critical for early steps) m_hat = m / (1 - beta1**t) v_hat = v / (1 - beta2**t) # Update parameters x = x - lr * m_hat / (np.sqrt(v_hat) + eps) return x
Using Optimizers in PyTorch
Python
import torch import torch.optim as optim model = MyModel() # Adam: default choice for most tasks optimizer = optim.Adam(model.parameters(), lr=1e-3) # AdamW: Adam with proper weight decay (for transformers) optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01) # SGD with momentum: often better final performance optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
When to Use What
Practical Guidelines:
- Start with AdamW (lr=1e-3 or 3e-4) for most deep learning tasks
- Use SGD + momentum for computer vision (ResNets, etc.) when you can tune the learning rate schedule
- Use AdamW for transformers and NLP tasks
- Avoid plain Adam — use AdamW instead, which handles weight decay correctly
Next Up: Convex Optimization
Learn the theoretical foundations: when can we guarantee finding the global optimum?
Next: Convex Optimization →
Lilly Tech Systems