Intermediate

VGG and GoogLeNet

A comprehensive guide to vgg and googlenet within the context of cnn architectures.

VGGNet: The Power of Depth (2014)

VGGNet, developed by the Visual Geometry Group at Oxford, demonstrated a powerful principle: using very small (3x3) convolutional filters consistently throughout the network and simply stacking more layers leads to better performance. VGG came in two main variants: VGG-16 (16 weight layers) and VGG-19 (19 weight layers).

The key insight was that two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution, but with fewer parameters and more non-linearity (two ReLU activations instead of one). Three stacked 3x3 convolutions have the same receptive field as a 7x7 convolution. This decomposition into smaller filters became a fundamental principle in CNN design.

# VGG-16 architecture pattern
class VGG16(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 2x conv3-64
            nn.Conv2d(3, 64, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, 3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            # Block 2: 2x conv3-128
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, 3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            # Block 3: 3x conv3-256
            nn.Conv2d(128, 256, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            # Block 4: 3x conv3-512
            nn.Conv2d(256, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
            # Block 5: 3x conv3-512
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096), nn.ReLU(True), nn.Dropout(),
            nn.Linear(4096, 4096), nn.ReLU(True), nn.Dropout(),
            nn.Linear(4096, num_classes),
        )

GoogLeNet / Inception (2014)

While VGG went deeper with uniform layers, GoogLeNet (also called Inception v1) went wider and more creative. It introduced the Inception module, which applies multiple filter sizes (1x1, 3x3, 5x5) in parallel and concatenates their outputs. This allows the network to capture features at multiple scales simultaneously.

The Inception Module

The key innovation was the 1x1 convolution used for dimensionality reduction before expensive 3x3 and 5x5 convolutions. This bottleneck design reduced the computational cost by an order of magnitude while maintaining representational power.

1x1 convolution path — Captures point-wise features and reduces channels
1x1 then 3x3 path — Bottleneck then spatial features at medium scale
1x1 then 5x5 path — Bottleneck then spatial features at large scale
3x3 max pool then 1x1 path — Pooling features with channel reduction

💡

Efficiency matters: GoogLeNet achieved better accuracy than VGG-16 with only 6.8M parameters compared to VGG's 138M. The Inception module demonstrated that clever architecture design can be more effective than brute-force depth.

Inception Evolution

The Inception architecture evolved through several versions, each addressing limitations of the previous:

Inception v2 — Replaced 5x5 convolutions with two stacked 3x3 convolutions, added batch normalization
Inception v3 — Factorized n x n convolutions into 1 x n followed by n x 1, further reducing parameters
Inception v4 — Combined Inception modules with residual connections from ResNet
Inception-ResNet — Hybrid architecture achieving state-of-the-art results on ImageNet

VGG vs GoogLeNet: Two Philosophies

VGG and GoogLeNet represent two fundamental approaches to CNN design that continue to influence modern architectures:

VGG philosophy: Simple, uniform building blocks stacked deep. Easy to understand, implement, and modify. More parameters but straightforward to scale.
Inception philosophy: Complex, multi-branch modules with careful efficiency engineering. Fewer parameters, better computation/accuracy trade-off, but harder to design and modify.

⚠

Practical consideration: VGG is still widely used as a feature extractor and for transfer learning because of its simplicity. If you need a pre-trained backbone for a custom task and do not need state-of-the-art efficiency, VGG is a safe, well-understood choice.

The next lesson covers ResNet, which solved the degradation problem that prevented training of very deep networks and introduced skip connections that transformed all of deep learning.

← PreviousLeNet and AlexNet Next →ResNet and Skip Connections