Intermediate

ResNet and Skip Connections

A comprehensive guide to resnet and skip connections within the context of cnn architectures.

The Degradation Problem

Before ResNet, a counterintuitive phenomenon puzzled researchers: adding more layers to a network beyond a certain depth made it perform worse, even on the training set. This was not overfitting (which would show good training performance but poor test performance). Instead, deeper networks had higher training error than shallower ones. This degradation problem limited practical network depth to roughly 20-30 layers.

The problem occurred because deeper networks had difficulty learning identity mappings. Even if additional layers should theoretically not hurt (they could just learn to pass data through unchanged), in practice, the optimization landscape made this difficult. Gradients became increasingly noisy as they propagated through dozens of layers, preventing effective learning.

The Residual Learning Framework

ResNet (He et al., 2015) introduced a deceptively simple solution: skip connections (also called residual connections or shortcut connections). Instead of learning the desired mapping H(x) directly, the network learns the residual F(x) = H(x) - x. The output becomes F(x) + x, where x is the input passed through the skip connection.

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += residual  # Skip connection!
        out = F.relu(out)
        return out

💡

Key insight: Learning a residual (the difference from identity) is easier than learning the full mapping. If the optimal transformation is close to identity, the network only needs to push F(x) toward zero rather than learning an identity mapping from scratch.

Why Skip Connections Work

Skip connections provide several benefits that enable training very deep networks:

Gradient highways — During backpropagation, gradients can flow directly through the skip connections, avoiding vanishing gradients across many layers
Easier optimization — The loss landscape of residual networks has fewer local minima, making optimization more reliable
Implicit ensembles — A ResNet can be viewed as an ensemble of many shallow networks of different depths, since skip connections create exponentially many paths through the network
Feature reuse — Early features remain accessible to later layers through the residual stream

Bottleneck Architecture

For deeper networks (ResNet-50, 101, 152), the standard two-layer residual block is replaced with a three-layer bottleneck block that uses 1x1 convolutions to reduce and restore dimensionality:

class BottleneckBlock(nn.Module):
    expansion = 4

    def __init__(self, in_channels, mid_channels):
        super().__init__()
        out_channels = mid_channels * self.expansion
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid_channels)
        self.conv2 = nn.Conv2d(mid_channels, mid_channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid_channels)
        self.conv3 = nn.Conv2d(mid_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return F.relu(out)

ResNet Variants and Impact

ResNet won the 2015 ImageNet competition with ResNet-152, achieving 3.57% top-5 error (surpassing human-level performance). The architecture has been extended in many ways:

ResNeXt — Adds grouped convolutions within residual blocks, a third dimension of cardinality alongside depth and width
Wide ResNet — Uses wider (more channels) but shallower networks, often faster to train
Pre-activation ResNet — Moves batch norm and ReLU before the convolution for better gradient flow
ResNeSt — Adds split-attention within residual blocks for channel-wise attention

⚠

Skip connections beyond CNNs: The residual connection is arguably the most important architectural innovation in deep learning. Transformers use residual connections around every attention and FFN sublayer. Without them, training deep transformers would be impossible. Any deep architecture you design should include residual connections.

The next lesson covers EfficientNet, which brought compound scaling and neural architecture search to optimize the trade-off between accuracy and efficiency.

← PreviousVGG and GoogLeNet Next →EfficientNet Architecture