Positional Encoding
A comprehensive guide to positional encoding within the context of transformer architecture deep dive.
The Position Problem
Self-attention is inherently position-agnostic. The attention computation between two tokens produces the same result regardless of where they appear in the sequence. Without positional information, a transformer would treat "The dog bit the man" and "The man bit the dog" identically, since both contain the same set of tokens.
Positional encoding solves this by injecting information about each token's position in the sequence into the model. The original Transformer paper proposed sinusoidal positional encodings, but modern models use several different approaches, each with distinct trade-offs.
Sinusoidal Positional Encoding
The original approach uses sine and cosine functions of different frequencies to create a unique encoding for each position:
import torch
import math
def sinusoidal_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term) # Even dimensions
pe[:, 1::2] = torch.cos(position * div_term) # Odd dimensions
return pe
# Each position gets a unique d_model-dimensional vector
pe = sinusoidal_encoding(max_len=512, d_model=512)
Properties of Sinusoidal Encoding
- Deterministic — No learned parameters, works out of the box
- Extrapolation — Can theoretically handle sequences longer than those seen during training
- Unique per position — No two positions have the same encoding vector
- Bounded magnitude — Values are always between -1 and 1
Learned Positional Embeddings
BERT and GPT-2 use learned positional embeddings — a lookup table where each position has a trainable vector. This is simpler to implement and often works slightly better than sinusoidal encoding for fixed-length sequences.
class LearnedPositionalEncoding(nn.Module):
def __init__(self, max_len, d_model):
super().__init__()
self.pe = nn.Embedding(max_len, d_model)
def forward(self, x):
positions = torch.arange(x.size(1), device=x.device)
return x + self.pe(positions)
The main limitation is that the model cannot handle sequences longer than max_len. If trained with max_len=512, it has no positional embedding for position 513.
Rotary Positional Encoding (RoPE)
RoPE, used in LLaMA, Mistral, and most modern LLMs, encodes position by rotating the query and key vectors in 2D subspaces. It naturally encodes relative positions (the attention between positions i and j depends only on the difference i-j) and has better length extrapolation properties than learned embeddings.
def apply_rope(x, freqs):
"""Apply Rotary Positional Encoding."""
# x shape: (batch, seq_len, num_heads, head_dim)
# Split head_dim into pairs for rotation
x_pairs = x.reshape(*x.shape[:-1], -1, 2)
x_complex = torch.view_as_complex(x_pairs.float())
# Multiply by rotation frequencies
x_rotated = x_complex * freqs
x_out = torch.view_as_real(x_rotated).reshape_as(x)
return x_out.type_as(x)
RoPE Advantages
- Relative position encoding — Attention scores naturally depend on relative distance
- Better extrapolation — Can extend to longer sequences with techniques like NTK-aware scaling or YaRN
- No additional parameters — Uses mathematical transformations, not learned weights
- Decays with distance — Attention between distant tokens naturally decreases, matching linguistic intuition
ALiBi: Attention with Linear Biases
ALiBi (Press et al., 2022) takes a completely different approach: instead of adding positional information to the input embeddings, it adds a linear bias to the attention scores based on the distance between tokens. Each head uses a different slope, allowing different heads to focus on different ranges.
Comparing Approaches
- Sinusoidal — Simple, no parameters, decent extrapolation. Used in original Transformer.
- Learned — Simple to implement, slightly better in-distribution. Used in BERT, GPT-2.
- RoPE — Best balance of quality and extrapolation. Used in LLaMA, Mistral, most modern LLMs.
- ALiBi — Excellent extrapolation, simpler than RoPE. Used in BLOOM, MPT.
In the next lesson, we will explore the full encoder-decoder architecture and how these components work together.
Lilly Tech Systems