Intermediate

Encoder-Decoder Architecture

A comprehensive guide to encoder-decoder architecture within the context of transformer architecture deep dive.

The Encoder-Decoder Framework

The original Transformer uses an encoder-decoder architecture designed for sequence-to-sequence tasks where the input and output sequences can have different lengths. Machine translation is the canonical example: translating an English sentence to French requires reading the entire English input (encoder) and generating the French output one token at a time (decoder).

While modern LLMs have moved toward decoder-only architectures, understanding the full encoder-decoder design is essential because it illustrates fundamental architectural concepts used across all transformer variants.

The Encoder

The encoder processes the input sequence in parallel and produces a sequence of continuous representations. It consists of N identical layers (the original paper used N=6), each with two sublayers:

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Sublayer 1: Self-attention with residual connection
        attn_output = self.self_attn(x, mask=mask)
        x = self.norm1(x + self.dropout(attn_output))
        # Sublayer 2: Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

Key Encoder Properties

  • Bidirectional attention — Each token can attend to all other tokens in the input, including tokens that come after it
  • Parallel processing — All tokens are processed simultaneously
  • Residual connections — Enable gradient flow through deep networks
  • Layer normalization — Stabilizes training by normalizing activations
💡
Pre-norm vs post-norm: The original Transformer applies layer normalization after the residual connection (post-norm). Modern transformers typically apply it before (pre-norm), which improves training stability and allows training without learning rate warmup.

The Decoder

The decoder generates the output sequence one token at a time. Each decoder layer has three sublayers:

  1. Masked self-attention — The decoder attends to previously generated tokens but not future tokens (causal masking)
  2. Cross-attention — The decoder attends to the encoder's output, allowing it to focus on relevant parts of the input
  3. Feed-forward network — Same structure as in the encoder
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # Sublayer 1: Masked self-attention
        attn1 = self.masked_self_attn(x, mask=tgt_mask)
        x = self.norm1(x + attn1)
        # Sublayer 2: Cross-attention to encoder output
        attn2 = self.cross_attn(x, encoder_output, mask=src_mask)
        x = self.norm2(x + attn2)
        # Sublayer 3: Feed-forward
        ff = self.feed_forward(x)
        x = self.norm3(x + ff)
        return x

Three Transformer Paradigms

The original encoder-decoder spawned three paradigms, each suited to different tasks:

Encoder-Only (BERT)

Uses only the encoder stack with bidirectional attention. Ideal for understanding tasks: classification, named entity recognition, sentiment analysis, and embedding generation. Cannot generate text autoregressively.

Decoder-Only (GPT)

Uses only the decoder stack with causal masking. Ideal for generation tasks: text completion, chat, code generation. The dominant architecture for modern LLMs because it can handle both understanding and generation.

Encoder-Decoder (T5, BART)

Uses both stacks. Ideal for sequence-to-sequence tasks: translation, summarization, question answering. Provides the richest modeling capability but with higher computational cost.

Architecture selection: For most new projects, decoder-only (GPT-style) is the default choice because it handles the widest range of tasks. Use encoder-decoder only when you have a clear sequence-to-sequence task with distinct input and output formats, such as translation or structured output generation.

The Feed-Forward Network

Each transformer layer includes a position-wise feed-forward network (FFN) that processes each position independently. The original paper uses a two-layer FFN with ReLU activation and an expansion factor of 4 (d_ff = 4 * d_model). Modern variants use SwiGLU or GeGLU activations with a 2/3 expansion factor for better parameter efficiency.

In the next lesson, we will survey the major transformer variants that have emerged since the original paper.