Graph connections

Draft

DBSCAN

Find clusters as connected dense neighborhoods while marking sparse points as noise.

algorithm intermediate machine-learningclusteringalgorithms

Hook problem: not every cluster is a round blob

K-Means and K-Medoids ask points to gather around centers. But some data forms curved regions, uneven islands, or scattered noise.

DBSCAN starts from a different question: where is the data dense enough to keep walking?

Dense neighborhoods grow clustersDBSCAN starts from core points, absorbs border points, and leaves sparse points as noise.
p1p2p3p4p5p6p7p8p9p10
epsilon0.90
minPts3
p1 neighborsp1, p2, p3

core point

p9 rolenoise

First naive idea: force every point into a cluster

Forcing every point into a cluster hides outliers. A point far from all dense regions should be allowed to stay suspicious.

DBSCAN repairs that by giving sparse points a name: noise.

Core invention: core, border, noise

With radius epsilon and threshold minPts:

  • A core point has at least minPts points in its epsilon-neighborhood, counting itself.
  • A border point is not core, but it is reachable from a core point.
  • A noise point is not density-reachable from any core point.

Clusters are connected components of density-reachable points.

Interactive trace lab

Clustering algorithm trace lab

Step 1/2: A point is core when its epsilon-neighborhood has at least minPts points.

find core points

Grow dense neighborhoods and leave isolated points as noise.

epsilon

0.9

minPts

3

Implementation sketch

for each unvisited point:
  find neighbors within epsilon;
  if the point is not core, mark it noise for now;
  otherwise expand a cluster through neighboring core points;

Correctness intuition and cost

The invariant is reachability: once expansion starts from a core point, every point added is connected through a chain of dense neighborhoods.

Without a spatial index, checking neighborhoods is O(n^2). With an index such as a grid, k-d tree, or R-tree, neighborhood queries can be much faster on suitable data.

Common confusions

  • DBSCAN does not require k.
  • epsilon is a distance scale, not a probability threshold.
  • Border points can be ambiguous when they touch multiple core regions.
  • Varying densities are hard for one fixed epsilon.
Which pain does each algorithm repair?The family splits by center assumptions, density assumptions, hierarchy, and soft probability.
K-Means

Move centroids until squared-distance clusters stop changing.

K-Medoids

Use real data points as centers so outliers have less leverage.

DBSCAN

Grow dense neighborhoods and leave isolated points as noise.

OPTICS

Order points by density reachability instead of choosing one fixed radius.

HDBSCAN

Keep clusters that stay stable across density levels.

EM for GMM

Alternate soft membership estimates with Gaussian parameter updates.

Exercises

  1. Why does DBSCAN need both epsilon and minPts?
  2. Why is noise not the same as a tiny cluster?
  3. What happens when two dense regions are connected by a thin bridge?

Graph connections : DBSCAN