Clustering Algorithms (11)

Algorithms for grouping similar data points without labels

Clustering is an unsupervised learning technique that groups data points so that items in the same cluster are more similar to each other than to those in other clusters. These 11 algorithms cover centroid-based, density-based, hierarchical, and distribution-based approaches.

Quick Reference Table

AlgorithmTypeRequires k?Handles Noise?Cluster Shape
K-MeansCentroidYesNoSpherical
Mini-Batch K-MeansCentroidYesNoSpherical
HierarchicalHierarchicalYes/NoNoAny
AgglomerativeHierarchicalYesNoAny
DBSCANDensityNoYesArbitrary
OPTICSDensityNoYesArbitrary
Mean ShiftCentroidNoNoArbitrary
SpectralGraphYesNoNon-convex
GMMDistributionYesNoElliptical
BIRCHHierarchicalYesNoSpherical
Affinity PropagationGraphNoNoArbitrary

1. K-Means

Description: Partitions data into k clusters by iteratively assigning points to the nearest centroid, then recalculating centroids. Minimizes within-cluster sum of squares (inertia). Converges to a local optimum.

Use Cases: Customer segmentation, image compression, document clustering, quantization, general-purpose clustering baseline.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

model = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
labels = model.fit_predict(X)

print(f"Inertia: {model.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"Cluster centers shape: {model.cluster_centers_.shape}")

2. Mini-Batch K-Means

Description: A faster variant of K-Means that uses random mini-batches of the data in each iteration instead of the full dataset. Slightly less accurate but much faster for large datasets.

Use Cases: Large-scale clustering, online clustering, when standard K-Means is too slow.

from sklearn.cluster import MiniBatchKMeans

model = MiniBatchKMeans(
    n_clusters=4,
    batch_size=100,
    n_init=10,
    random_state=42
)
labels = model.fit_predict(X)

print(f"Inertia: {model.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

3. Hierarchical Clustering

Description: Builds a tree (dendrogram) of clusters by either merging small clusters (agglomerative / bottom-up) or splitting large clusters (divisive / top-down). The dendrogram can be cut at any level to produce the desired number of clusters.

Use Cases: Taxonomy construction, gene expression analysis, when you need to explore cluster structure at multiple granularities.

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt

# Compute linkage matrix
Z = linkage(X, method='ward', metric='euclidean')

# Cut at desired number of clusters
labels = fcluster(Z, t=4, criterion='maxclust')
print(f"Number of clusters: {len(set(labels))}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

4. Agglomerative Clustering

Description: Bottom-up hierarchical clustering in scikit-learn. Starts with each point as its own cluster and merges the closest pairs iteratively. Supports different linkage criteria: ward, complete, average, single.

Use Cases: Customer segmentation, image segmentation, document grouping.

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(
    n_clusters=4,
    linkage='ward',        # 'ward', 'complete', 'average', 'single'
    metric='euclidean'
)
labels = model.fit_predict(X)

print(f"Number of clusters: {model.n_clusters_}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

5. DBSCAN

Description: Density-Based Spatial Clustering of Applications with Noise. Groups together points that are closely packed (high density) and marks points in low-density regions as outliers. Does not require specifying the number of clusters.

Use Cases: Anomaly detection, geospatial analysis, arbitrary-shaped clusters, when you do not know k in advance.

from sklearn.cluster import DBSCAN

model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = model.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
if n_clusters > 1:
    mask = labels != -1
    print(f"Silhouette Score: {silhouette_score(X[mask], labels[mask]):.4f}")

6. OPTICS

Description: Ordering Points To Identify the Clustering Structure. An extension of DBSCAN that does not require a fixed eps parameter. Instead, it produces a reachability plot from which clusters of varying density can be extracted.

Use Cases: Multi-density clusters, exploratory analysis, when DBSCAN's fixed eps is too restrictive.

from sklearn.cluster import OPTICS

model = OPTICS(
    min_samples=5,
    xi=0.05,                 # Steepness for cluster extraction
    min_cluster_size=0.05,   # Min fraction of points in a cluster
    metric='euclidean'
)
labels = model.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {(labels == -1).sum()}")

7. Mean Shift

Description: A centroid-based algorithm that works by shifting each data point towards the mode (peak) of the local density. Points that converge to the same mode form a cluster. Automatically determines the number of clusters.

Use Cases: Image segmentation, object tracking in video, when you do not want to specify k.

from sklearn.cluster import MeanShift, estimate_bandwidth

# Estimate bandwidth (kernel size)
bandwidth = estimate_bandwidth(X, quantile=0.2)

model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
labels = model.fit_predict(X)

print(f"Clusters found: {len(set(labels))}")
print(f"Cluster centers: {model.cluster_centers_.shape}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

8. Spectral Clustering

Description: Uses the eigenvalues of the similarity (affinity) matrix to perform dimensionality reduction before clustering in the reduced space with K-Means. Can find non-convex clusters that K-Means cannot.

Use Cases: Image segmentation, community detection in graphs, non-convex cluster shapes.

from sklearn.cluster import SpectralClustering

model = SpectralClustering(
    n_clusters=4,
    affinity='rbf',           # or 'nearest_neighbors'
    gamma=1.0,
    assign_labels='kmeans',
    random_state=42
)
labels = model.fit_predict(X)

print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

9. Gaussian Mixture Model (GMM)

Description: A probabilistic model that assumes data is generated from a mixture of k Gaussian distributions. Uses the EM (Expectation-Maximization) algorithm to estimate the mean, covariance, and weight of each Gaussian component. Provides soft (probabilistic) cluster assignments.

Use Cases: When clusters have different shapes/sizes, density estimation, when you need probability of cluster membership.

from sklearn.mixture import GaussianMixture

model = GaussianMixture(
    n_components=4,
    covariance_type='full',   # 'full', 'tied', 'diag', 'spherical'
    n_init=5,
    random_state=42
)
model.fit(X)
labels = model.predict(X)

# Soft assignments (probabilities)
proba = model.predict_proba(X)

print(f"BIC: {model.bic(X):.2f}")
print(f"AIC: {model.aic(X):.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"Converged: {model.converged_}")

10. BIRCH

Description: Balanced Iterative Reducing and Clustering using Hierarchies. A memory-efficient hierarchical clustering algorithm that incrementally builds a compact summary tree (CF tree) of the data, suitable for very large datasets.

Use Cases: Very large datasets that do not fit in memory, incremental clustering, preprocessing for other clustering methods.

from sklearn.cluster import Birch

model = Birch(
    n_clusters=4,
    threshold=0.5,            # Radius of subclusters
    branching_factor=50
)
labels = model.fit_predict(X)

print(f"Clusters: {len(set(labels))}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")

11. Affinity Propagation

Description: Identifies exemplars (representative points) by exchanging messages between data points. Each point votes on which other point should be its exemplar. Does not require specifying the number of clusters.

Use Cases: Finding representative examples, when k is unknown, small-to-medium datasets.

from sklearn.cluster import AffinityPropagation

model = AffinityPropagation(
    damping=0.9,              # Damping factor (0.5 to 1.0)
    preference=None,          # None = median similarity
    max_iter=200,
    random_state=42
)
labels = model.fit_predict(X)

n_clusters = len(model.cluster_centers_indices_)
print(f"Clusters found: {n_clusters}")
print(f"Exemplar indices: {model.cluster_centers_indices_}")
if n_clusters > 1:
    print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")