Clustering Algorithms (11)
Algorithms for grouping similar data points without labels
Clustering is an unsupervised learning technique that groups data points so that items in the same cluster are more similar to each other than to those in other clusters. These 11 algorithms cover centroid-based, density-based, hierarchical, and distribution-based approaches.
Quick Reference Table
| Algorithm | Type | Requires k? | Handles Noise? | Cluster Shape |
|---|---|---|---|---|
| K-Means | Centroid | Yes | No | Spherical |
| Mini-Batch K-Means | Centroid | Yes | No | Spherical |
| Hierarchical | Hierarchical | Yes/No | No | Any |
| Agglomerative | Hierarchical | Yes | No | Any |
| DBSCAN | Density | No | Yes | Arbitrary |
| OPTICS | Density | No | Yes | Arbitrary |
| Mean Shift | Centroid | No | No | Arbitrary |
| Spectral | Graph | Yes | No | Non-convex |
| GMM | Distribution | Yes | No | Elliptical |
| BIRCH | Hierarchical | Yes | No | Spherical |
| Affinity Propagation | Graph | No | No | Arbitrary |
1. K-Means
Description: Partitions data into k clusters by iteratively assigning points to the nearest centroid, then recalculating centroids. Minimizes within-cluster sum of squares (inertia). Converges to a local optimum.
Use Cases: Customer segmentation, image compression, document clustering, quantization, general-purpose clustering baseline.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
model = KMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
labels = model.fit_predict(X)
print(f"Inertia: {model.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"Cluster centers shape: {model.cluster_centers_.shape}")
2. Mini-Batch K-Means
Description: A faster variant of K-Means that uses random mini-batches of the data in each iteration instead of the full dataset. Slightly less accurate but much faster for large datasets.
Use Cases: Large-scale clustering, online clustering, when standard K-Means is too slow.
from sklearn.cluster import MiniBatchKMeans
model = MiniBatchKMeans(
n_clusters=4,
batch_size=100,
n_init=10,
random_state=42
)
labels = model.fit_predict(X)
print(f"Inertia: {model.inertia_:.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
3. Hierarchical Clustering
Description: Builds a tree (dendrogram) of clusters by either merging small clusters (agglomerative / bottom-up) or splitting large clusters (divisive / top-down). The dendrogram can be cut at any level to produce the desired number of clusters.
Use Cases: Taxonomy construction, gene expression analysis, when you need to explore cluster structure at multiple granularities.
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
# Compute linkage matrix
Z = linkage(X, method='ward', metric='euclidean')
# Cut at desired number of clusters
labels = fcluster(Z, t=4, criterion='maxclust')
print(f"Number of clusters: {len(set(labels))}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
4. Agglomerative Clustering
Description: Bottom-up hierarchical clustering in scikit-learn. Starts with each point as its own cluster and merges the closest pairs iteratively. Supports different linkage criteria: ward, complete, average, single.
Use Cases: Customer segmentation, image segmentation, document grouping.
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(
n_clusters=4,
linkage='ward', # 'ward', 'complete', 'average', 'single'
metric='euclidean'
)
labels = model.fit_predict(X)
print(f"Number of clusters: {model.n_clusters_}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
5. DBSCAN
Description: Density-Based Spatial Clustering of Applications with Noise. Groups together points that are closely packed (high density) and marks points in low-density regions as outliers. Does not require specifying the number of clusters.
Use Cases: Anomaly detection, geospatial analysis, arbitrary-shaped clusters, when you do not know k in advance.
from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.5, min_samples=5, metric='euclidean')
labels = model.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = (labels == -1).sum()
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
if n_clusters > 1:
mask = labels != -1
print(f"Silhouette Score: {silhouette_score(X[mask], labels[mask]):.4f}")
6. OPTICS
Description: Ordering Points To Identify the Clustering Structure. An extension of DBSCAN that does not require a fixed eps parameter. Instead, it produces a reachability plot from which clusters of varying density can be extracted.
Use Cases: Multi-density clusters, exploratory analysis, when DBSCAN's fixed eps is too restrictive.
from sklearn.cluster import OPTICS
model = OPTICS(
min_samples=5,
xi=0.05, # Steepness for cluster extraction
min_cluster_size=0.05, # Min fraction of points in a cluster
metric='euclidean'
)
labels = model.fit_predict(X)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {(labels == -1).sum()}")
7. Mean Shift
Description: A centroid-based algorithm that works by shifting each data point towards the mode (peak) of the local density. Points that converge to the same mode form a cluster. Automatically determines the number of clusters.
Use Cases: Image segmentation, object tracking in video, when you do not want to specify k.
from sklearn.cluster import MeanShift, estimate_bandwidth
# Estimate bandwidth (kernel size)
bandwidth = estimate_bandwidth(X, quantile=0.2)
model = MeanShift(bandwidth=bandwidth, bin_seeding=True)
labels = model.fit_predict(X)
print(f"Clusters found: {len(set(labels))}")
print(f"Cluster centers: {model.cluster_centers_.shape}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
8. Spectral Clustering
Description: Uses the eigenvalues of the similarity (affinity) matrix to perform dimensionality reduction before clustering in the reduced space with K-Means. Can find non-convex clusters that K-Means cannot.
Use Cases: Image segmentation, community detection in graphs, non-convex cluster shapes.
from sklearn.cluster import SpectralClustering
model = SpectralClustering(
n_clusters=4,
affinity='rbf', # or 'nearest_neighbors'
gamma=1.0,
assign_labels='kmeans',
random_state=42
)
labels = model.fit_predict(X)
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
9. Gaussian Mixture Model (GMM)
Description: A probabilistic model that assumes data is generated from a mixture of k Gaussian distributions. Uses the EM (Expectation-Maximization) algorithm to estimate the mean, covariance, and weight of each Gaussian component. Provides soft (probabilistic) cluster assignments.
Use Cases: When clusters have different shapes/sizes, density estimation, when you need probability of cluster membership.
from sklearn.mixture import GaussianMixture
model = GaussianMixture(
n_components=4,
covariance_type='full', # 'full', 'tied', 'diag', 'spherical'
n_init=5,
random_state=42
)
model.fit(X)
labels = model.predict(X)
# Soft assignments (probabilities)
proba = model.predict_proba(X)
print(f"BIC: {model.bic(X):.2f}")
print(f"AIC: {model.aic(X):.2f}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"Converged: {model.converged_}")
10. BIRCH
Description: Balanced Iterative Reducing and Clustering using Hierarchies. A memory-efficient hierarchical clustering algorithm that incrementally builds a compact summary tree (CF tree) of the data, suitable for very large datasets.
Use Cases: Very large datasets that do not fit in memory, incremental clustering, preprocessing for other clustering methods.
from sklearn.cluster import Birch
model = Birch(
n_clusters=4,
threshold=0.5, # Radius of subclusters
branching_factor=50
)
labels = model.fit_predict(X)
print(f"Clusters: {len(set(labels))}")
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")
11. Affinity Propagation
Description: Identifies exemplars (representative points) by exchanging messages between data points. Each point votes on which other point should be its exemplar. Does not require specifying the number of clusters.
Use Cases: Finding representative examples, when k is unknown, small-to-medium datasets.
from sklearn.cluster import AffinityPropagation
model = AffinityPropagation(
damping=0.9, # Damping factor (0.5 to 1.0)
preference=None, # None = median similarity
max_iter=200,
random_state=42
)
labels = model.fit_predict(X)
n_clusters = len(model.cluster_centers_indices_)
print(f"Clusters found: {n_clusters}")
print(f"Exemplar indices: {model.cluster_centers_indices_}")
if n_clusters > 1:
print(f"Silhouette Score: {silhouette_score(X, labels):.4f}")