Mutual Information Intermediate
Mutual information (MI) measures the amount of information that one random variable contains about another. Unlike correlation, MI captures any kind of dependency — linear or nonlinear — making it a powerful tool for AI and machine learning.
The Formula
Mathematics
I(X; Y) = SUM_x SUM_y P(x, y) * log2( P(x, y) / (P(x) * P(y)) )
# Equivalently:
I(X; Y) = H(X) + H(Y) - H(X, Y)
I(X; Y) = H(X) - H(X | Y)
I(X; Y) = D_KL( P(X, Y) || P(X) * P(Y) )
Intuition: Mutual information measures how much knowing Y reduces your uncertainty about X (or vice versa). If X and Y are independent, MI = 0. If knowing Y completely determines X, MI = H(X).
Properties of Mutual Information
- Non-negative: I(X; Y) ≥ 0
- Symmetric: I(X; Y) = I(Y; X)
- Zero iff independent: I(X; Y) = 0 ⇔ X and Y are independent
- Upper bounded: I(X; Y) ≤ min(H(X), H(Y))
Computing MI in Python
Python
from sklearn.metrics import mutual_info_score from sklearn.feature_selection import mutual_info_classif import numpy as np # Discrete MI using sklearn labels_true = [0, 0, 1, 1, 2, 2] labels_pred = [0, 0, 1, 2, 2, 2] mi = mutual_info_score(labels_true, labels_pred) print(f"MI = {mi:.4f} nats") # Feature selection: MI between features and target from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) mi_scores = mutual_info_classif(X, y, random_state=42) for i, score in enumerate(mi_scores): print(f"Feature {i}: MI = {score:.4f}")
AI Applications
Feature Selection
Select the features with the highest MI with the target variable. Unlike correlation, MI captures nonlinear relationships.
Information Bottleneck
The information bottleneck theory suggests deep networks learn by compressing inputs while preserving information relevant to the output.
Clustering Evaluation
Normalized mutual information (NMI) and adjusted MI (AMI) are standard metrics for evaluating clustering quality against ground truth.
Contrastive Learning
Methods like InfoNCE (used in CLIP, SimCLR) maximize a lower bound on MI between different views of the same data.
MI vs. Correlation
| Property | Correlation | Mutual Information |
|---|---|---|
| Captures linear relationships | Yes | Yes |
| Captures nonlinear relationships | No | Yes |
| Range | [-1, 1] | [0, ∞) |
| Works with categorical data | Limited | Yes |
| Computational cost | Low | Higher (needs density estimation) |
Estimation Challenge: Computing MI for continuous variables requires density estimation, which is notoriously difficult in high dimensions. Use k-nearest-neighbor estimators or binning for practical applications.
Lilly Tech Systems