Draft
HDBSCAN
Build a density hierarchy and select clusters that remain stable across density levels.
Hook problem: density changes by region
DBSCAN asks for one density scale. OPTICS exposes many possible scales. HDBSCAN asks which clusters persist as the density requirement changes.
The key move is to stop treating every cut as equal.
wide-density: p1, p2, p3, p4, p5, p6, p7, p8, p9 | stability 0.80
A: p1, p2, p3, p4 | stability 2.90
B: p5, p6, p7, p8 | stability 3.10
A-core: p1, p2, p3 | stability 1.40
B-core: p5, p6, p7 | stability 1.50
First naive idea: choose the prettiest density cut
Picking one cut by eye is fragile. A short-lived branch might look real for one scale and vanish immediately at the next.
HDBSCAN repairs this by scoring stability across a hierarchy.
Core invention: condensed density tree
HDBSCAN transforms distances into mutual-reachability distances, builds a hierarchy, condenses tiny branches, and selects stable clusters.
The intuition is:
Short-lived branches are treated as noise or substructure.
Interactive trace lab
Clustering algorithm trace lab
Step 1/2: Clusters split as the density requirement becomes stricter.
Keep clusters that stay stable across density levels.
A, B
Implementation sketch
compute core distances;
build mutual-reachability graph;
build a minimum spanning tree;
condense the hierarchy by minimum cluster size;
select stable branches as clusters;
Correctness intuition and cost
The hierarchy keeps a record of how clusters split as density becomes stricter. Stability prevents a cluster from winning merely because it exists at one lucky threshold.
Practical implementations depend heavily on nearest-neighbor and minimum-spanning-tree routines. A simple dense implementation can be around O(n^2), while optimized versions can do better on favorable data.
Common confusions
- HDBSCAN is not just DBSCAN with automatic
epsilon. - It can label points as noise.
- It has parameters such as minimum cluster size; it is not parameter-free.
- The hierarchy is part of the model, not just a plotting trick.
Move centroids until squared-distance clusters stop changing.
Use real data points as centers so outliers have less leverage.
Grow dense neighborhoods and leave isolated points as noise.
Order points by density reachability instead of choosing one fixed radius.
Keep clusters that stay stable across density levels.
Alternate soft membership estimates with Gaussian parameter updates.
Exercises
- Why is persistence across density levels useful?
- What kind of branch should be treated as noise?
- Why does HDBSCAN still need a minimum cluster-size choice?
Graph connections : HDBSCAN