Cross-Entropy Loss Advanced
Cross-entropy loss is the most widely used loss function in deep learning. It powers classification models, language models, and virtually every neural network that outputs a probability distribution. Its roots are directly in information theory.
Cross-Entropy Defined
The cross-entropy between a true distribution P and a predicted distribution Q is:
Mathematics
H(P, Q) = -SUM[ P(x) * log(Q(x)) ] # Relationship to KL divergence: H(P, Q) = H(P) + D_KL(P || Q) # Since H(P) is constant during training, minimizing # cross-entropy = minimizing KL divergence
Why Cross-Entropy? Minimizing cross-entropy is equivalent to maximum likelihood estimation (MLE). It is also equivalent to minimizing KL divergence between the true and predicted distributions. This is why it is the natural choice for classification.
Binary Cross-Entropy
For binary classification (two classes, 0 and 1):
Python
import numpy as np def binary_cross_entropy(y_true, y_pred, eps=1e-15): """Binary cross-entropy loss.""" y_pred = np.clip(y_pred, eps, 1 - eps) # Prevent log(0) return -np.mean( y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred) ) # Perfect prediction print(binary_cross_entropy( np.array([1, 0, 1]), np.array([0.99, 0.01, 0.99]) )) # ~0.01 # Bad prediction print(binary_cross_entropy( np.array([1, 0, 1]), np.array([0.1, 0.9, 0.2]) )) # ~2.12
Categorical Cross-Entropy
For multi-class classification with K classes:
Python
import torch import torch.nn as nn # PyTorch combines softmax + cross-entropy in one step loss_fn = nn.CrossEntropyLoss() # Logits (raw model output, before softmax) logits = torch.tensor([[2.0, 1.0, 0.1]]) # 3 classes target = torch.tensor([0]) # True class = 0 loss = loss_fn(logits, target) print(f"Loss: {loss.item():.4f}") # 0.4170
Cross-Entropy in Language Models
Language models like GPT and Claude are trained using cross-entropy loss over the vocabulary. For each token position, the model predicts a probability distribution over all possible next tokens:
Conceptual
# For each position t in the sequence: Loss_t = -log P_model(actual_next_token | previous_tokens) # Total loss = average over all positions: Loss = -(1/T) * SUM_t[ log P_model(x_t | x_1, ..., x_{t-1}) ] # Perplexity = 2^Loss (or e^Loss if using natural log) # Lower perplexity = better language model
Perplexity
Perplexity is the standard evaluation metric for language models, and it is the exponentiation of cross-entropy loss:
| Cross-Entropy Loss | Perplexity | Interpretation |
|---|---|---|
| 1.0 | 2.0 | Like choosing between 2 equally likely options |
| 3.32 | 10.0 | Like choosing between 10 equally likely options |
| 6.64 | 100.0 | Very uncertain predictions |
Label Smoothing: In practice, using hard one-hot labels can cause overconfident predictions. Label smoothing replaces the target [1, 0, 0] with something like [0.9, 0.05, 0.05], which regularizes the model and often improves generalization.
Lilly Tech Systems