Intermediate

Convolutional Neural Networks (CNNs)

Learn how CNNs process images through convolution operations, filters, and pooling layers. Explore landmark architectures and build your own CNN with PyTorch.

What Are CNNs?

Convolutional Neural Networks are specialized neural networks designed for processing grid-like data, particularly images. Unlike fully connected networks that flatten images into 1D vectors (losing spatial information), CNNs preserve the 2D spatial structure and exploit local patterns through convolution operations.

CNNs are the backbone of modern computer vision. They power image classification, object detection, facial recognition, medical image analysis, and much more.

The Convolution Operation

Convolution is a mathematical operation that slides a small matrix (called a filter or kernel) across the input image, computing element-wise multiplication and summing the results at each position. This produces a feature map that highlights specific patterns.

Convolution Concept
# A 3x3 filter sliding across an image:

Input Image (5x5):        Filter (3x3):
[1 0 1 0 1]               [1 0 1]
[0 1 0 1 0]               [0 1 0]
[1 0 1 0 1]               [1 0 1]
[0 1 0 1 0]
[1 0 1 0 1]

# At position (0,0):
# (1*1 + 0*0 + 1*1) + (0*0 + 1*1 + 0*0) + (1*1 + 0*0 + 1*1) = 5

# The filter "detects" a specific pattern
# Different filters detect edges, textures, shapes, etc.

Filters and Kernels

Filters are small learnable weight matrices (typically 3x3 or 5x5) that detect specific features in the input:

  • Early layers: Filters learn to detect low-level features like edges, corners, and color gradients.
  • Middle layers: Filters combine low-level features into textures, patterns, and shapes.
  • Deep layers: Filters recognize high-level concepts like eyes, wheels, or entire objects.

Key parameters:

  • Stride: How many pixels the filter moves at each step. Stride of 1 moves one pixel at a time; stride of 2 skips every other position, reducing output size.
  • Padding: Adding zeros around the input border to control output dimensions. "Same" padding preserves input size; "valid" padding uses no padding.

Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps, decreasing computation and providing translation invariance:

  • Max Pooling: Takes the maximum value in each pooling window. Most common; preserves the strongest activations.
  • Average Pooling: Takes the average value. Smoother but may lose important details.
  • Global Average Pooling: Averages the entire feature map into a single value per channel. Often used before the final classification layer.

CNN Architecture Patterns

The field has evolved through several landmark architectures:

Architecture Year Key Innovation Depth
LeNet-5 1998 First practical CNN for digit recognition 5 layers
AlexNet 2012 GPU training, ReLU, dropout. Won ImageNet. 8 layers
VGG 2014 Uniform 3x3 filters, very deep 16–19 layers
GoogLeNet/Inception 2014 Inception modules with multi-scale filters 22 layers
ResNet 2015 Skip connections enabling very deep networks 50–152 layers
EfficientNet 2019 Compound scaling of depth, width, and resolution Varies
💡
ResNet's key insight: Skip connections (residual connections) allow gradients to flow directly through the network, solving the vanishing gradient problem. Instead of learning a mapping H(x), the network learns the residual F(x) = H(x) - x, making it easier to train very deep networks.

Transfer Learning

Transfer learning is the practice of using a model pre-trained on a large dataset (like ImageNet with 1.2M images) as a starting point for a new task. This is extremely powerful because:

  • Pre-trained models have already learned general visual features (edges, textures, shapes).
  • You need far less data for your specific task.
  • Training is much faster since most weights are already good.

Common approaches:

  • Feature extraction: Freeze all pre-trained layers and only train a new classifier head.
  • Fine-tuning: Unfreeze some or all pre-trained layers and train with a very low learning rate.

Image Classification Example with PyTorch

Python (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models

# Define a CNN from scratch
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),              # 28x28 -> 14x14
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),              # 14x14 -> 7x7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# Data loading with augmentation
transform = transforms.Compose([
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_data = datasets.MNIST('./data', train=True,
                            download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(
    train_data, batch_size=64, shuffle=True)

# Training
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(5):
    model.train()
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1} complete")

# --- Transfer Learning with ResNet ---
resnet = models.resnet18(pretrained=True)
# Freeze all layers
for param in resnet.parameters():
    param.requires_grad = False
# Replace final layer for our task
resnet.fc = nn.Linear(resnet.fc.in_features, num_classes)
# Only the new fc layer will be trained
Practical advice: Always start with transfer learning using a pre-trained model like ResNet or EfficientNet. Train from scratch only when your data is very different from ImageNet (e.g., medical images or satellite imagery) or when you need a tiny model for edge deployment.