Intermediate

Positional Encoding

A comprehensive guide to positional encoding within the context of transformer architecture deep dive.

The Position Problem

Self-attention is inherently position-agnostic. The attention computation between two tokens produces the same result regardless of where they appear in the sequence. Without positional information, a transformer would treat "The dog bit the man" and "The man bit the dog" identically, since both contain the same set of tokens.

Positional encoding solves this by injecting information about each token's position in the sequence into the model. The original Transformer paper proposed sinusoidal positional encodings, but modern models use several different approaches, each with distinct trade-offs.

Sinusoidal Positional Encoding

The original approach uses sine and cosine functions of different frequencies to create a unique encoding for each position:

import torch
import math

def sinusoidal_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)  # Even dimensions
    pe[:, 1::2] = torch.cos(position * div_term)  # Odd dimensions
    return pe

# Each position gets a unique d_model-dimensional vector
pe = sinusoidal_encoding(max_len=512, d_model=512)
💡
Why sine and cosine? The sinusoidal encoding has a useful property: for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This means the model can easily learn to attend to relative positions.

Properties of Sinusoidal Encoding

  • Deterministic — No learned parameters, works out of the box
  • Extrapolation — Can theoretically handle sequences longer than those seen during training
  • Unique per position — No two positions have the same encoding vector
  • Bounded magnitude — Values are always between -1 and 1

Learned Positional Embeddings

BERT and GPT-2 use learned positional embeddings — a lookup table where each position has a trainable vector. This is simpler to implement and often works slightly better than sinusoidal encoding for fixed-length sequences.

class LearnedPositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.pe = nn.Embedding(max_len, d_model)

    def forward(self, x):
        positions = torch.arange(x.size(1), device=x.device)
        return x + self.pe(positions)

The main limitation is that the model cannot handle sequences longer than max_len. If trained with max_len=512, it has no positional embedding for position 513.

Rotary Positional Encoding (RoPE)

RoPE, used in LLaMA, Mistral, and most modern LLMs, encodes position by rotating the query and key vectors in 2D subspaces. It naturally encodes relative positions (the attention between positions i and j depends only on the difference i-j) and has better length extrapolation properties than learned embeddings.

def apply_rope(x, freqs):
    """Apply Rotary Positional Encoding."""
    # x shape: (batch, seq_len, num_heads, head_dim)
    # Split head_dim into pairs for rotation
    x_pairs = x.reshape(*x.shape[:-1], -1, 2)
    x_complex = torch.view_as_complex(x_pairs.float())
    # Multiply by rotation frequencies
    x_rotated = x_complex * freqs
    x_out = torch.view_as_real(x_rotated).reshape_as(x)
    return x_out.type_as(x)

RoPE Advantages

  • Relative position encoding — Attention scores naturally depend on relative distance
  • Better extrapolation — Can extend to longer sequences with techniques like NTK-aware scaling or YaRN
  • No additional parameters — Uses mathematical transformations, not learned weights
  • Decays with distance — Attention between distant tokens naturally decreases, matching linguistic intuition

ALiBi: Attention with Linear Biases

ALiBi (Press et al., 2022) takes a completely different approach: instead of adding positional information to the input embeddings, it adds a linear bias to the attention scores based on the distance between tokens. Each head uses a different slope, allowing different heads to focus on different ranges.

Length extrapolation matters: Many production applications need to handle documents much longer than the training data. Choosing a positional encoding that extrapolates well (RoPE with scaling, ALiBi) is a critical architecture decision for long-context applications.

Comparing Approaches

  1. Sinusoidal — Simple, no parameters, decent extrapolation. Used in original Transformer.
  2. Learned — Simple to implement, slightly better in-distribution. Used in BERT, GPT-2.
  3. RoPE — Best balance of quality and extrapolation. Used in LLaMA, Mistral, most modern LLMs.
  4. ALiBi — Excellent extrapolation, simpler than RoPE. Used in BLOOM, MPT.

In the next lesson, we will explore the full encoder-decoder architecture and how these components work together.