Convolutional Neural Networks (CNNs)
Learn how CNNs process images through convolution operations, filters, and pooling layers. Explore landmark architectures and build your own CNN with PyTorch.
What Are CNNs?
Convolutional Neural Networks are specialized neural networks designed for processing grid-like data, particularly images. Unlike fully connected networks that flatten images into 1D vectors (losing spatial information), CNNs preserve the 2D spatial structure and exploit local patterns through convolution operations.
CNNs are the backbone of modern computer vision. They power image classification, object detection, facial recognition, medical image analysis, and much more.
The Convolution Operation
Convolution is a mathematical operation that slides a small matrix (called a filter or kernel) across the input image, computing element-wise multiplication and summing the results at each position. This produces a feature map that highlights specific patterns.
# A 3x3 filter sliding across an image: Input Image (5x5): Filter (3x3): [1 0 1 0 1] [1 0 1] [0 1 0 1 0] [0 1 0] [1 0 1 0 1] [1 0 1] [0 1 0 1 0] [1 0 1 0 1] # At position (0,0): # (1*1 + 0*0 + 1*1) + (0*0 + 1*1 + 0*0) + (1*1 + 0*0 + 1*1) = 5 # The filter "detects" a specific pattern # Different filters detect edges, textures, shapes, etc.
Filters and Kernels
Filters are small learnable weight matrices (typically 3x3 or 5x5) that detect specific features in the input:
- Early layers: Filters learn to detect low-level features like edges, corners, and color gradients.
- Middle layers: Filters combine low-level features into textures, patterns, and shapes.
- Deep layers: Filters recognize high-level concepts like eyes, wheels, or entire objects.
Key parameters:
- Stride: How many pixels the filter moves at each step. Stride of 1 moves one pixel at a time; stride of 2 skips every other position, reducing output size.
- Padding: Adding zeros around the input border to control output dimensions. "Same" padding preserves input size; "valid" padding uses no padding.
Pooling Layers
Pooling layers reduce the spatial dimensions of feature maps, decreasing computation and providing translation invariance:
- Max Pooling: Takes the maximum value in each pooling window. Most common; preserves the strongest activations.
- Average Pooling: Takes the average value. Smoother but may lose important details.
- Global Average Pooling: Averages the entire feature map into a single value per channel. Often used before the final classification layer.
CNN Architecture Patterns
The field has evolved through several landmark architectures:
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | First practical CNN for digit recognition | 5 layers |
| AlexNet | 2012 | GPU training, ReLU, dropout. Won ImageNet. | 8 layers |
| VGG | 2014 | Uniform 3x3 filters, very deep | 16–19 layers |
| GoogLeNet/Inception | 2014 | Inception modules with multi-scale filters | 22 layers |
| ResNet | 2015 | Skip connections enabling very deep networks | 50–152 layers |
| EfficientNet | 2019 | Compound scaling of depth, width, and resolution | Varies |
Transfer Learning
Transfer learning is the practice of using a model pre-trained on a large dataset (like ImageNet with 1.2M images) as a starting point for a new task. This is extremely powerful because:
- Pre-trained models have already learned general visual features (edges, textures, shapes).
- You need far less data for your specific task.
- Training is much faster since most weights are already good.
Common approaches:
- Feature extraction: Freeze all pre-trained layers and only train a new classifier head.
- Fine-tuning: Unfreeze some or all pre-trained layers and train with a very low learning rate.
Image Classification Example with PyTorch
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms, models # Define a CNN from scratch class SimpleCNN(nn.Module): def __init__(self, num_classes=10): super().__init__() self.features = nn.Sequential( nn.Conv2d(1, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 28x28 -> 14x14 nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 14x14 -> 7x7 ) self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(64 * 7 * 7, 128), nn.ReLU(), nn.Dropout(0.5), nn.Linear(128, num_classes), ) def forward(self, x): x = self.features(x) x = self.classifier(x) return x # Data loading with augmentation transform = transforms.Compose([ transforms.RandomRotation(10), transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) train_data = datasets.MNIST('./data', train=True, download=True, transform=transform) train_loader = torch.utils.data.DataLoader( train_data, batch_size=64, shuffle=True) # Training model = SimpleCNN() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) for epoch in range(5): model.train() for images, labels in train_loader: optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f"Epoch {epoch+1} complete") # --- Transfer Learning with ResNet --- resnet = models.resnet18(pretrained=True) # Freeze all layers for param in resnet.parameters(): param.requires_grad = False # Replace final layer for our task resnet.fc = nn.Linear(resnet.fc.in_features, num_classes) # Only the new fc layer will be trained
Lilly Tech Systems