Encoder-Decoder Architecture
A comprehensive guide to encoder-decoder architecture within the context of transformer architecture deep dive.
The Encoder-Decoder Framework
The original Transformer uses an encoder-decoder architecture designed for sequence-to-sequence tasks where the input and output sequences can have different lengths. Machine translation is the canonical example: translating an English sentence to French requires reading the entire English input (encoder) and generating the French output one token at a time (decoder).
While modern LLMs have moved toward decoder-only architectures, understanding the full encoder-decoder design is essential because it illustrates fundamental architectural concepts used across all transformer variants.
The Encoder
The encoder processes the input sequence in parallel and produces a sequence of continuous representations. It consists of N identical layers (the original paper used N=6), each with two sublayers:
class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Sublayer 1: Self-attention with residual connection
attn_output = self.self_attn(x, mask=mask)
x = self.norm1(x + self.dropout(attn_output))
# Sublayer 2: Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
Key Encoder Properties
- Bidirectional attention — Each token can attend to all other tokens in the input, including tokens that come after it
- Parallel processing — All tokens are processed simultaneously
- Residual connections — Enable gradient flow through deep networks
- Layer normalization — Stabilizes training by normalizing activations
The Decoder
The decoder generates the output sequence one token at a time. Each decoder layer has three sublayers:
- Masked self-attention — The decoder attends to previously generated tokens but not future tokens (causal masking)
- Cross-attention — The decoder attends to the encoder's output, allowing it to focus on relevant parts of the input
- Feed-forward network — Same structure as in the encoder
class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.masked_self_attn = MultiHeadAttention(d_model, num_heads)
self.cross_attn = MultiHeadAttention(d_model, num_heads)
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.norm3 = nn.LayerNorm(d_model)
def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
# Sublayer 1: Masked self-attention
attn1 = self.masked_self_attn(x, mask=tgt_mask)
x = self.norm1(x + attn1)
# Sublayer 2: Cross-attention to encoder output
attn2 = self.cross_attn(x, encoder_output, mask=src_mask)
x = self.norm2(x + attn2)
# Sublayer 3: Feed-forward
ff = self.feed_forward(x)
x = self.norm3(x + ff)
return x
Three Transformer Paradigms
The original encoder-decoder spawned three paradigms, each suited to different tasks:
Encoder-Only (BERT)
Uses only the encoder stack with bidirectional attention. Ideal for understanding tasks: classification, named entity recognition, sentiment analysis, and embedding generation. Cannot generate text autoregressively.
Decoder-Only (GPT)
Uses only the decoder stack with causal masking. Ideal for generation tasks: text completion, chat, code generation. The dominant architecture for modern LLMs because it can handle both understanding and generation.
Encoder-Decoder (T5, BART)
Uses both stacks. Ideal for sequence-to-sequence tasks: translation, summarization, question answering. Provides the richest modeling capability but with higher computational cost.
The Feed-Forward Network
Each transformer layer includes a position-wise feed-forward network (FFN) that processes each position independently. The original paper uses a two-layer FFN with ReLU activation and an expansion factor of 4 (d_ff = 4 * d_model). Modern variants use SwiGLU or GeGLU activations with a 2/3 expansion factor for better parameter efficiency.
In the next lesson, we will survey the major transformer variants that have emerged since the original paper.
Lilly Tech Systems