Introduction to Optimization for ML Beginner

Optimization is the engine that powers all machine learning. When we say a model "learns," what we really mean is that an optimization algorithm adjusts the model's parameters to minimize a loss function. Understanding optimization is understanding how ML actually works.

The Optimization Problem in ML

Every ML training process solves the same fundamental problem: find parameters θ that minimize a loss function L(θ):

Python

import numpy as np

# The ML optimization problem in pseudocode:
# theta* = argmin_theta L(theta)
#        = argmin_theta (1/N) * sum(loss(model(x_i, theta), y_i))

# In practice with PyTorch:
# for epoch in range(num_epochs):
#     for batch in dataloader:
#         predictions = model(batch.x)       # Forward pass
#         loss = criterion(predictions, batch.y)  # Compute loss
#         loss.backward()                    # Compute gradients
#         optimizer.step()                   # Update parameters
#         optimizer.zero_grad()              # Reset gradients

Key Insight: The choice of optimizer, learning rate, and training schedule can be the difference between a model that converges in minutes and one that never converges at all. Optimization is not just a detail — it is a core competency for ML practitioners.

Key Challenges

Challenge	Description	Solutions
Non-convexity	Loss surfaces of neural networks have many local minima and saddle points	SGD noise, momentum, large batch training
High dimensionality	Modern models have billions of parameters	First-order methods (no Hessian needed)
Noisy gradients	Mini-batch gradients are noisy estimates of the true gradient	Momentum, adaptive learning rates
Ill-conditioning	Loss surface curvature varies dramatically across dimensions	Adam, preconditioning, normalization
Generalization	Optimizing training loss does not guarantee good test performance	Early stopping, regularization, dropout

Course Roadmap

Gradient Descent
The foundational algorithm and its variants: batch, stochastic, mini-batch, and with momentum.
Modern Optimizers
Adaptive methods (Adam, AdaGrad, RMSProp) that automatically tune learning rates per parameter.
Convex Optimization
The theoretical foundation: when can we guarantee finding the global minimum?
Hyperparameter Tuning
Systematic methods for finding the best training configuration.

Ready to Begin?

Let's start with the algorithm that started it all: gradient descent.

Next: Gradient Descent →

← Course Overview Gradient Descent →