VGG and GoogLeNet
A comprehensive guide to vgg and googlenet within the context of cnn architectures.
VGGNet: The Power of Depth (2014)
VGGNet, developed by the Visual Geometry Group at Oxford, demonstrated a powerful principle: using very small (3x3) convolutional filters consistently throughout the network and simply stacking more layers leads to better performance. VGG came in two main variants: VGG-16 (16 weight layers) and VGG-19 (19 weight layers).
The key insight was that two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution, but with fewer parameters and more non-linearity (two ReLU activations instead of one). Three stacked 3x3 convolutions have the same receptive field as a 7x7 convolution. This decomposition into smaller filters became a fundamental principle in CNN design.
# VGG-16 architecture pattern
class VGG16(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
self.features = nn.Sequential(
# Block 1: 2x conv3-64
nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# Block 2: 2x conv3-128
nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# Block 3: 3x conv3-256
nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# Block 4: 3x conv3-512
nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
# Block 5: 3x conv3-512
nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2),
)
self.classifier = nn.Sequential(
nn.Linear(512 * 7 * 7, 4096), nn.ReLU(True), nn.Dropout(),
nn.Linear(4096, 4096), nn.ReLU(True), nn.Dropout(),
nn.Linear(4096, num_classes),
)
GoogLeNet / Inception (2014)
While VGG went deeper with uniform layers, GoogLeNet (also called Inception v1) went wider and more creative. It introduced the Inception module, which applies multiple filter sizes (1x1, 3x3, 5x5) in parallel and concatenates their outputs. This allows the network to capture features at multiple scales simultaneously.
The Inception Module
The key innovation was the 1x1 convolution used for dimensionality reduction before expensive 3x3 and 5x5 convolutions. This bottleneck design reduced the computational cost by an order of magnitude while maintaining representational power.
- 1x1 convolution path — Captures point-wise features and reduces channels
- 1x1 then 3x3 path — Bottleneck then spatial features at medium scale
- 1x1 then 5x5 path — Bottleneck then spatial features at large scale
- 3x3 max pool then 1x1 path — Pooling features with channel reduction
Inception Evolution
The Inception architecture evolved through several versions, each addressing limitations of the previous:
- Inception v2 — Replaced 5x5 convolutions with two stacked 3x3 convolutions, added batch normalization
- Inception v3 — Factorized n x n convolutions into 1 x n followed by n x 1, further reducing parameters
- Inception v4 — Combined Inception modules with residual connections from ResNet
- Inception-ResNet — Hybrid architecture achieving state-of-the-art results on ImageNet
VGG vs GoogLeNet: Two Philosophies
VGG and GoogLeNet represent two fundamental approaches to CNN design that continue to influence modern architectures:
- VGG philosophy: Simple, uniform building blocks stacked deep. Easy to understand, implement, and modify. More parameters but straightforward to scale.
- Inception philosophy: Complex, multi-branch modules with careful efficiency engineering. Fewer parameters, better computation/accuracy trade-off, but harder to design and modify.
The next lesson covers ResNet, which solved the degradation problem that prevented training of very deep networks and introduced skip connections that transformed all of deep learning.
Lilly Tech Systems