Draft
DBSCAN
Find clusters as connected dense neighborhoods while marking sparse points as noise.
Hook problem: not every cluster is a round blob
K-Means and K-Medoids ask points to gather around centers. But some data forms curved regions, uneven islands, or scattered noise.
DBSCAN starts from a different question: where is the data dense enough to keep walking?
core point
First naive idea: force every point into a cluster
Forcing every point into a cluster hides outliers. A point far from all dense regions should be allowed to stay suspicious.
DBSCAN repairs that by giving sparse points a name: noise.
Core invention: core, border, noise
With radius epsilon and threshold minPts:
- A core point has at least
minPtspoints in itsepsilon-neighborhood, counting itself. - A border point is not core, but it is reachable from a core point.
- A noise point is not density-reachable from any core point.
Clusters are connected components of density-reachable points.
Interactive trace lab
Clustering algorithm trace lab
Step 1/2: A point is core when its epsilon-neighborhood has at least minPts points.
Grow dense neighborhoods and leave isolated points as noise.
0.9
3
Implementation sketch
for each unvisited point:
find neighbors within epsilon;
if the point is not core, mark it noise for now;
otherwise expand a cluster through neighboring core points;
Correctness intuition and cost
The invariant is reachability: once expansion starts from a core point, every point added is connected through a chain of dense neighborhoods.
Without a spatial index, checking neighborhoods is O(n^2). With an index such as a grid, k-d tree, or R-tree, neighborhood queries can be much faster on suitable data.
Common confusions
- DBSCAN does not require
k. epsilonis a distance scale, not a probability threshold.- Border points can be ambiguous when they touch multiple core regions.
- Varying densities are hard for one fixed
epsilon.
Move centroids until squared-distance clusters stop changing.
Use real data points as centers so outliers have less leverage.
Grow dense neighborhoods and leave isolated points as noise.
Order points by density reachability instead of choosing one fixed radius.
Keep clusters that stay stable across density levels.
Alternate soft membership estimates with Gaussian parameter updates.
Exercises
- Why does DBSCAN need both
epsilonandminPts? - Why is noise not the same as a tiny cluster?
- What happens when two dense regions are connected by a thin bridge?
Graph connections : DBSCAN