ResNet and Skip Connections
A comprehensive guide to resnet and skip connections within the context of cnn architectures.
The Degradation Problem
Before ResNet, a counterintuitive phenomenon puzzled researchers: adding more layers to a network beyond a certain depth made it perform worse, even on the training set. This was not overfitting (which would show good training performance but poor test performance). Instead, deeper networks had higher training error than shallower ones. This degradation problem limited practical network depth to roughly 20-30 layers.
The problem occurred because deeper networks had difficulty learning identity mappings. Even if additional layers should theoretically not hurt (they could just learn to pass data through unchanged), in practice, the optimization landscape made this difficult. Gradients became increasingly noisy as they propagated through dozens of layers, preventing effective learning.
The Residual Learning Framework
ResNet (He et al., 2015) introduced a deceptively simple solution: skip connections (also called residual connections or shortcut connections). Instead of learning the desired mapping H(x) directly, the network learns the residual F(x) = H(x) - x. The output becomes F(x) + x, where x is the input passed through the skip connection.
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += residual # Skip connection!
out = F.relu(out)
return out
Why Skip Connections Work
Skip connections provide several benefits that enable training very deep networks:
- Gradient highways — During backpropagation, gradients can flow directly through the skip connections, avoiding vanishing gradients across many layers
- Easier optimization — The loss landscape of residual networks has fewer local minima, making optimization more reliable
- Implicit ensembles — A ResNet can be viewed as an ensemble of many shallow networks of different depths, since skip connections create exponentially many paths through the network
- Feature reuse — Early features remain accessible to later layers through the residual stream
Bottleneck Architecture
For deeper networks (ResNet-50, 101, 152), the standard two-layer residual block is replaced with a three-layer bottleneck block that uses 1x1 convolutions to reduce and restore dimensionality:
class BottleneckBlock(nn.Module):
expansion = 4
def __init__(self, in_channels, mid_channels):
super().__init__()
out_channels = mid_channels * self.expansion
self.conv1 = nn.Conv2d(in_channels, mid_channels, 1, bias=False)
self.bn1 = nn.BatchNorm2d(mid_channels)
self.conv2 = nn.Conv2d(mid_channels, mid_channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(mid_channels)
self.conv3 = nn.Conv2d(mid_channels, out_channels, 1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
out += self.shortcut(x)
return F.relu(out)
ResNet Variants and Impact
ResNet won the 2015 ImageNet competition with ResNet-152, achieving 3.57% top-5 error (surpassing human-level performance). The architecture has been extended in many ways:
- ResNeXt — Adds grouped convolutions within residual blocks, a third dimension of cardinality alongside depth and width
- Wide ResNet — Uses wider (more channels) but shallower networks, often faster to train
- Pre-activation ResNet — Moves batch norm and ReLU before the convolution for better gradient flow
- ResNeSt — Adds split-attention within residual blocks for channel-wise attention
The next lesson covers EfficientNet, which brought compound scaling and neural architecture search to optimize the trade-off between accuracy and efficiency.
Lilly Tech Systems