Graph connections

Draft

HDBSCAN

Build a density hierarchy and select clusters that remain stable across density levels.

algorithm advanced machine-learningclusteringalgorithms

Hook problem: density changes by region

DBSCAN asks for one density scale. OPTICS exposes many possible scales. HDBSCAN asks which clusters persist as the density requirement changes.

The key move is to stop treating every cut as equal.

Stable branches surviveHDBSCAN condenses a density hierarchy and favors clusters that persist.
lambda = 0.25

wide-density: p1, p2, p3, p4, p5, p6, p7, p8, p9 | stability 0.80

lambda = 0.90

A: p1, p2, p3, p4 | stability 2.90

B: p5, p6, p7, p8 | stability 3.10

lambda = 1.60

A-core: p1, p2, p3 | stability 1.40

B-core: p5, p6, p7 | stability 1.50

First naive idea: choose the prettiest density cut

Picking one cut by eye is fragile. A short-lived branch might look real for one scale and vanish immediately at the next.

HDBSCAN repairs this by scoring stability across a hierarchy.

Core invention: condensed density tree

HDBSCAN transforms distances into mutual-reachability distances, builds a hierarchy, condenses tiny branches, and selects stable clusters.

The intuition is:

stable clustermany points staying together across many density levels\text{stable cluster} \approx \text{many points staying together across many density levels}

Short-lived branches are treated as noise or substructure.

Interactive trace lab

Clustering algorithm trace lab

Step 1/2: Clusters split as the density requirement becomes stricter.

sweep density levels

Keep clusters that stay stable across density levels.

stable branches

A, B

Implementation sketch

compute core distances;
build mutual-reachability graph;
build a minimum spanning tree;
condense the hierarchy by minimum cluster size;
select stable branches as clusters;

Correctness intuition and cost

The hierarchy keeps a record of how clusters split as density becomes stricter. Stability prevents a cluster from winning merely because it exists at one lucky threshold.

Practical implementations depend heavily on nearest-neighbor and minimum-spanning-tree routines. A simple dense implementation can be around O(n^2), while optimized versions can do better on favorable data.

Common confusions

  • HDBSCAN is not just DBSCAN with automatic epsilon.
  • It can label points as noise.
  • It has parameters such as minimum cluster size; it is not parameter-free.
  • The hierarchy is part of the model, not just a plotting trick.
Which pain does each algorithm repair?The family splits by center assumptions, density assumptions, hierarchy, and soft probability.
K-Means

Move centroids until squared-distance clusters stop changing.

K-Medoids

Use real data points as centers so outliers have less leverage.

DBSCAN

Grow dense neighborhoods and leave isolated points as noise.

OPTICS

Order points by density reachability instead of choosing one fixed radius.

HDBSCAN

Keep clusters that stay stable across density levels.

EM for GMM

Alternate soft membership estimates with Gaussian parameter updates.

Exercises

  1. Why is persistence across density levels useful?
  2. What kind of branch should be treated as noise?
  3. Why does HDBSCAN still need a minimum cluster-size choice?

Graph connections : HDBSCAN