DATA & STATISTICS

Cluster analysis

Family of unsupervised methods that groups observations by similarity. Classical algorithms: k-means (MacQueen, 1967), hierarchical clustering, DBSCAN. Validation via silhouette (Rousseeuw, 1987), stability, and interpretability.

Extended definition

Cluster analysis is the family of unsupervised methods that groups observations into sets (clusters) based on similarity — without predefined labels, in contrast to supervised classification. Three families dominate: k-means (MacQueen, 1967) partitions nn observations into kk clusters minimizing the sum of squared distances to the centroid; hierarchical clustering (agglomerative: merges close clusters iteratively; divisive: splits iteratively) produces a dendrogram; DBSCAN (Ester et al., 1996) uses local density to identify arbitrarily shaped clusters and detect outliers. Critical methodological decisions: distance measure (Euclidean, Manhattan, cosine), variable normalization (essential for Euclidean distances), number of clusters kk (elbow plot, gap statistic, silhouette). Rousseeuw (1987) proposed the silhouette coefficient, the standard quality metric for clustering (ranging from 1-1 to 11; values close to 11 indicate observations well-fitted to their cluster). Modern methods include spectral clustering, Gaussian mixture models, and hierarchical clustering on embeddings (BERTopic etc.).

When it applies

Cluster analysis applies in exploratory research where the goal is to discover latent structure without labels: patient segmentation in epidemiological studies, consumer profiles in marketing, behavioral group identification in education, document grouping in NLP, unsupervised classification in genomics. It is useful in conceptual dimensionality reduction: a cluster replaces multiple correlated variables with an interpretable category. In ML, clustering is a frequent step in feature engineering and in pseudo-label generation for semi-supervised learning.

When it does not apply

It does not apply when labels exist or can be obtained at reasonable cost — supervised classification is more informative. It does not apply as proof of group existence: clustering algorithms always produce clusters, even in random data; genuine structure existence must be validated by inter-sample stability (bootstrap clustering) and by interpretability. It does not apply to categorical variables with Euclidean-distance k-means — use k-modes, k-prototypes, or Gower distance. It does not apply to datasets with heterogeneous scales without normalization: a variable with larger magnitude dominates distance. In high dimensionality (p>np > n), distances concentrate (curse of dimensionality) and standard clustering can fail — prior reduction (PCA, embeddings) is necessary.

Applications by field

Genomics: hierarchical clustering in gene expression to identify disease subtypes; transcriptomics. — Marketing: k-means for customer segmentation; behavioral profiles in CRM. — NLP: clustering of embeddings to group documents semantically; BERTopic in modern topic modeling. — Education: identification of learning profiles; pattern analysis in learning analytics.

Common pitfalls

The first pitfall is treating clustering results as proof of real structure — k-means with k=3k = 3 always produces 3 groups, even in structureless data. Stability validation (bootstrap, gap statistic) and silhouette are essential. The second is failing to normalize variables: age in years (0–100) and income in thousands (0–500,000) in the same Euclidean analysis = income entirely dominates. The third is confusing centroid interpretation with typology: a centroid is a mean; no real subject needs to be near it. The fourth is choosing kk by elbow plot without confirming with gap statistic or silhouette: elbow is heuristic and often ambiguous. The fifth is over-interpreting differences between clusters: post-hoc comparisons between clusters on variables used in the clustering itself are circular — difference is constructed, not discovered.

Last updated —