Advanced

Modern CNN Innovations

A comprehensive guide to modern cnn innovations within the context of cnn architectures.

CNNs After Transformers

When Vision Transformers (ViT) demonstrated competitive performance on image classification in 2020, many predicted the end of CNNs. However, researchers revisited CNN design principles and showed that with modern training techniques and architectural refinements, pure CNNs can match or exceed transformer performance at comparable scales. This lesson covers the most important modern CNN innovations.

ConvNeXt: A Modernized CNN

ConvNeXt (Liu et al., 2022) systematically modernized the standard ResNet by incorporating design principles borrowed from transformers. Starting from ResNet-50, they made incremental changes, each improving performance, until the resulting architecture matched Swin Transformer's accuracy.

Key ConvNeXt Design Changes

Macro design — Changed the stage ratio from (3,4,6,3) to (3,3,9,3), matching Swin Transformer
Patchify stem — Replaced the 7x7 conv + maxpool with a 4x4 stride-4 convolution (like ViT's patch embedding)
Depthwise convolution — Replaced 3x3 convolutions with 7x7 depthwise convolutions (like attention's spatial mixing)
Inverted bottleneck — Used a 1x1 expand, 7x7 depthwise, 1x1 shrink pattern
GELU activation — Replaced ReLU with GELU, matching transformer convention
Layer normalization — Replaced batch normalization with layer normalization
Fewer activation functions — Applied activation only once per block, like transformers

class ConvNeXtBlock(nn.Module):
    def __init__(self, dim, drop_path=0.0):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, 7, padding=3, groups=dim)
        self.norm = nn.LayerNorm(dim)
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # Expand
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)  # Shrink
        self.gamma = nn.Parameter(1e-6 * torch.ones(dim))  # Layer scale

    def forward(self, x):
        residual = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # NCHW -> NHWC for LayerNorm
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = self.gamma * x
        x = x.permute(0, 3, 1, 2)  # NHWC -> NCHW
        return residual + x

💡

Lesson from ConvNeXt: Architecture innovations from transformers (GELU, LayerNorm, fewer activations, patchify stems) are not inherently tied to attention. Many improvements attributed to transformers are actually general deep learning improvements that benefit CNNs equally.

RepVGG: Reparameterized Convolutions

RepVGG introduces a training-time multi-branch architecture (with skip connections and 1x1 convolutions) that is reparameterized into a simple stack of 3x3 convolutions at inference time. This gives you the training benefits of complex architectures with the inference speed of simple convolutions.

Deformable Convolutions

Standard convolutions sample from a fixed grid. Deformable convolutions learn offsets that allow the sampling grid to deform, enabling the network to adapt its receptive field to the shape of objects. This is particularly useful for object detection where objects have varying shapes and sizes.

Neural Architecture Search (NAS)

Instead of hand-designing CNN architectures, NAS uses automated search to find optimal architectures. NASNet, MnasNet, and EfficientNet all used NAS to discover their base architectures. Modern NAS approaches can search over enormous architecture spaces in hours rather than days.

Search space — The set of possible architectures (kernel sizes, channel counts, block types, connections)
Search strategy — How to explore the space efficiently (reinforcement learning, evolutionary algorithms, differentiable search)
Performance estimation — How to evaluate candidates quickly (weight sharing, proxy tasks, early stopping)

⚠

CNN vs Transformer decision: For most computer vision tasks with limited data (less than 1M images), CNNs with pre-training remain highly competitive with transformers. CNNs have built-in inductive biases (translation invariance, locality) that make them more data-efficient. Use transformers when you have very large datasets or need global attention from early layers.

The final lesson provides a practical framework for choosing the right CNN architecture for your specific use case.

← PreviousEfficientNet Architecture Next →Choosing CNN Architecture