Mutual Information Intermediate

Mutual information (MI) measures the amount of information that one random variable contains about another. Unlike correlation, MI captures any kind of dependency — linear or nonlinear — making it a powerful tool for AI and machine learning.

The Formula

Mathematics

I(X; Y) = SUM_x SUM_y P(x, y) * log2( P(x, y) / (P(x) * P(y)) )

# Equivalently:
I(X; Y) = H(X) + H(Y) - H(X, Y)
I(X; Y) = H(X) - H(X | Y)
I(X; Y) = D_KL( P(X, Y) || P(X) * P(Y) )

Intuition: Mutual information measures how much knowing Y reduces your uncertainty about X (or vice versa). If X and Y are independent, MI = 0. If knowing Y completely determines X, MI = H(X).

Properties of Mutual Information

Non-negative: I(X; Y) ≥ 0
Symmetric: I(X; Y) = I(Y; X)
Zero iff independent: I(X; Y) = 0 ⇔ X and Y are independent
Upper bounded: I(X; Y) ≤ min(H(X), H(Y))

Computing MI in Python

Python

from sklearn.metrics import mutual_info_score
from sklearn.feature_selection import mutual_info_classif
import numpy as np

# Discrete MI using sklearn
labels_true = [0, 0, 1, 1, 2, 2]
labels_pred = [0, 0, 1, 2, 2, 2]
mi = mutual_info_score(labels_true, labels_pred)
print(f"MI = {mi:.4f} nats")

# Feature selection: MI between features and target
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
mi_scores = mutual_info_classif(X, y, random_state=42)
for i, score in enumerate(mi_scores):
    print(f"Feature {i}: MI = {score:.4f}")

AI Applications

🔬

Feature Selection

Select the features with the highest MI with the target variable. Unlike correlation, MI captures nonlinear relationships.

🔧

Information Bottleneck

The information bottleneck theory suggests deep networks learn by compressing inputs while preserving information relevant to the output.

🎯

Clustering Evaluation

Normalized mutual information (NMI) and adjusted MI (AMI) are standard metrics for evaluating clustering quality against ground truth.

💡

Contrastive Learning

Methods like InfoNCE (used in CLIP, SimCLR) maximize a lower bound on MI between different views of the same data.

MI vs. Correlation

Property	Correlation	Mutual Information
Captures linear relationships	Yes	Yes
Captures nonlinear relationships	No	Yes
Range	[-1, 1]	[0, ∞)
Works with categorical data	Limited	Yes
Computational cost	Low	Higher (needs density estimation)

Estimation Challenge: Computing MI for continuous variables requires density estimation, which is notoriously difficult in high dimensions. Use k-nearest-neighbor estimators or binning for practical applications.

← KL Divergence Cross-Entropy Loss →