Modern CNN Innovations
A comprehensive guide to modern cnn innovations within the context of cnn architectures.
CNNs After Transformers
When Vision Transformers (ViT) demonstrated competitive performance on image classification in 2020, many predicted the end of CNNs. However, researchers revisited CNN design principles and showed that with modern training techniques and architectural refinements, pure CNNs can match or exceed transformer performance at comparable scales. This lesson covers the most important modern CNN innovations.
ConvNeXt: A Modernized CNN
ConvNeXt (Liu et al., 2022) systematically modernized the standard ResNet by incorporating design principles borrowed from transformers. Starting from ResNet-50, they made incremental changes, each improving performance, until the resulting architecture matched Swin Transformer's accuracy.
Key ConvNeXt Design Changes
- Macro design — Changed the stage ratio from (3,4,6,3) to (3,3,9,3), matching Swin Transformer
- Patchify stem — Replaced the 7x7 conv + maxpool with a 4x4 stride-4 convolution (like ViT's patch embedding)
- Depthwise convolution — Replaced 3x3 convolutions with 7x7 depthwise convolutions (like attention's spatial mixing)
- Inverted bottleneck — Used a 1x1 expand, 7x7 depthwise, 1x1 shrink pattern
- GELU activation — Replaced ReLU with GELU, matching transformer convention
- Layer normalization — Replaced batch normalization with layer normalization
- Fewer activation functions — Applied activation only once per block, like transformers
class ConvNeXtBlock(nn.Module):
def __init__(self, dim, drop_path=0.0):
super().__init__()
self.dwconv = nn.Conv2d(dim, dim, 7, padding=3, groups=dim)
self.norm = nn.LayerNorm(dim)
self.pwconv1 = nn.Linear(dim, 4 * dim) # Expand
self.act = nn.GELU()
self.pwconv2 = nn.Linear(4 * dim, dim) # Shrink
self.gamma = nn.Parameter(1e-6 * torch.ones(dim)) # Layer scale
def forward(self, x):
residual = x
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1) # NCHW -> NHWC for LayerNorm
x = self.norm(x)
x = self.pwconv1(x)
x = self.act(x)
x = self.pwconv2(x)
x = self.gamma * x
x = x.permute(0, 3, 1, 2) # NHWC -> NCHW
return residual + x
RepVGG: Reparameterized Convolutions
RepVGG introduces a training-time multi-branch architecture (with skip connections and 1x1 convolutions) that is reparameterized into a simple stack of 3x3 convolutions at inference time. This gives you the training benefits of complex architectures with the inference speed of simple convolutions.
Deformable Convolutions
Standard convolutions sample from a fixed grid. Deformable convolutions learn offsets that allow the sampling grid to deform, enabling the network to adapt its receptive field to the shape of objects. This is particularly useful for object detection where objects have varying shapes and sizes.
Neural Architecture Search (NAS)
Instead of hand-designing CNN architectures, NAS uses automated search to find optimal architectures. NASNet, MnasNet, and EfficientNet all used NAS to discover their base architectures. Modern NAS approaches can search over enormous architecture spaces in hours rather than days.
- Search space — The set of possible architectures (kernel sizes, channel counts, block types, connections)
- Search strategy — How to explore the space efficiently (reinforcement learning, evolutionary algorithms, differentiable search)
- Performance estimation — How to evaluate candidates quickly (weight sharing, proxy tasks, early stopping)
The final lesson provides a practical framework for choosing the right CNN architecture for your specific use case.
Lilly Tech Systems