Cross-Entropy Loss Advanced

Cross-entropy loss is the most widely used loss function in deep learning. It powers classification models, language models, and virtually every neural network that outputs a probability distribution. Its roots are directly in information theory.

Cross-Entropy Defined

The cross-entropy between a true distribution P and a predicted distribution Q is:

Mathematics

H(P, Q) = -SUM[ P(x) * log(Q(x)) ]

# Relationship to KL divergence:
H(P, Q) = H(P) + D_KL(P || Q)

# Since H(P) is constant during training, minimizing
# cross-entropy = minimizing KL divergence

Why Cross-Entropy? Minimizing cross-entropy is equivalent to maximum likelihood estimation (MLE). It is also equivalent to minimizing KL divergence between the true and predicted distributions. This is why it is the natural choice for classification.

Binary Cross-Entropy

For binary classification (two classes, 0 and 1):

Python

import numpy as np

def binary_cross_entropy(y_true, y_pred, eps=1e-15):
    """Binary cross-entropy loss."""
    y_pred = np.clip(y_pred, eps, 1 - eps)  # Prevent log(0)
    return -np.mean(
        y_true * np.log(y_pred) +
        (1 - y_true) * np.log(1 - y_pred)
    )

# Perfect prediction
print(binary_cross_entropy(
    np.array([1, 0, 1]),
    np.array([0.99, 0.01, 0.99])
))  # ~0.01

# Bad prediction
print(binary_cross_entropy(
    np.array([1, 0, 1]),
    np.array([0.1, 0.9, 0.2])
))  # ~2.12

Categorical Cross-Entropy

For multi-class classification with K classes:

Python

import torch
import torch.nn as nn

# PyTorch combines softmax + cross-entropy in one step
loss_fn = nn.CrossEntropyLoss()

# Logits (raw model output, before softmax)
logits = torch.tensor([[2.0, 1.0, 0.1]])  # 3 classes
target = torch.tensor([0])                 # True class = 0

loss = loss_fn(logits, target)
print(f"Loss: {loss.item():.4f}")  # 0.4170

Cross-Entropy in Language Models

Language models like GPT and Claude are trained using cross-entropy loss over the vocabulary. For each token position, the model predicts a probability distribution over all possible next tokens:

Conceptual

# For each position t in the sequence:
Loss_t = -log P_model(actual_next_token | previous_tokens)

# Total loss = average over all positions:
Loss = -(1/T) * SUM_t[ log P_model(x_t | x_1, ..., x_{t-1}) ]

# Perplexity = 2^Loss (or e^Loss if using natural log)
# Lower perplexity = better language model

Perplexity

Perplexity is the standard evaluation metric for language models, and it is the exponentiation of cross-entropy loss:

Cross-Entropy Loss	Perplexity	Interpretation
1.0	2.0	Like choosing between 2 equally likely options
3.32	10.0	Like choosing between 10 equally likely options
6.64	100.0	Very uncertain predictions

Label Smoothing: In practice, using hard one-hot labels can cause overconfident predictions. Label smoothing replaces the target [1, 0, 0] with something like [0.9, 0.05, 0.05], which regularizes the model and often improves generalization.

← Mutual Information Best Practices →