Introduction to Optimization for ML Beginner

Optimization is the engine that powers all machine learning. When we say a model "learns," what we really mean is that an optimization algorithm adjusts the model's parameters to minimize a loss function. Understanding optimization is understanding how ML actually works.

The Optimization Problem in ML

Every ML training process solves the same fundamental problem: find parameters θ that minimize a loss function L(θ):

Python
import numpy as np

# The ML optimization problem in pseudocode:
# theta* = argmin_theta L(theta)
#        = argmin_theta (1/N) * sum(loss(model(x_i, theta), y_i))

# In practice with PyTorch:
# for epoch in range(num_epochs):
#     for batch in dataloader:
#         predictions = model(batch.x)       # Forward pass
#         loss = criterion(predictions, batch.y)  # Compute loss
#         loss.backward()                    # Compute gradients
#         optimizer.step()                   # Update parameters
#         optimizer.zero_grad()              # Reset gradients
Key Insight: The choice of optimizer, learning rate, and training schedule can be the difference between a model that converges in minutes and one that never converges at all. Optimization is not just a detail — it is a core competency for ML practitioners.

Key Challenges

Challenge Description Solutions
Non-convexity Loss surfaces of neural networks have many local minima and saddle points SGD noise, momentum, large batch training
High dimensionality Modern models have billions of parameters First-order methods (no Hessian needed)
Noisy gradients Mini-batch gradients are noisy estimates of the true gradient Momentum, adaptive learning rates
Ill-conditioning Loss surface curvature varies dramatically across dimensions Adam, preconditioning, normalization
Generalization Optimizing training loss does not guarantee good test performance Early stopping, regularization, dropout

Course Roadmap

  1. Gradient Descent

    The foundational algorithm and its variants: batch, stochastic, mini-batch, and with momentum.

  2. Modern Optimizers

    Adaptive methods (Adam, AdaGrad, RMSProp) that automatically tune learning rates per parameter.

  3. Convex Optimization

    The theoretical foundation: when can we guarantee finding the global minimum?

  4. Hyperparameter Tuning

    Systematic methods for finding the best training configuration.

Ready to Begin?

Let's start with the algorithm that started it all: gradient descent.

Next: Gradient Descent →