Graph connections

Draft

K-Means Clustering

Partition points by repeatedly assigning them to nearest centroids and moving each centroid to its assigned mean.

algorithm intermediate machine-learningclusteringalgorithms

Hook problem: make piles without labels

Suppose you have points on a map but no answer key. You only know you want k = 3 compact groups.

A naive idea is to draw three circles by eye. K-Means turns that drawing habit into a repeatable loop.

Centroids chase assignmentsK-Means alternates nearest-center assignment with mean recomputation.
a1a2a3b1b2b3c1c2c3outC1C2C3
1. assignassign-1

Assign every point to the nearest current centroid by squared distance.

2. updateupdate-1

Move each centroid to the mean of the points assigned to it.

3. assignassign-2

Reassign with the moved centroids; only points near boundaries can change.

4. donedone

Assignments and centroids agree for this fixture, so the local optimum is stable.

First naive idea: choose centers once

If the first centers are a little wrong, many points are assigned to the wrong pile. Keeping those centers fixed bakes in the mistake.

The repair is small: after assigning points, let the centers move.

Core invention: assignment then update

K-Means alternates two moves:

  1. Assign every point to its nearest centroid.
  2. Move each centroid to the mean of the points assigned to it.

The objective it tries to reduce is the within-cluster squared error:

i=1nxiμci2\sum_{i=1}^{n}\lVert x_i-\mu_{c_i}\rVert^2

In words: each point pays the squared distance to the centroid of its assigned cluster.

Interactive trace lab

Clustering algorithm trace lab

Step 1/4: Assign every point to the nearest current centroid by squared distance.

assign

Move centroids until squared-distance clusters stop changing.

centers

C1=(1, 1); C2=(6.20, 1); C3=(3.40, 5.60)

changed points

a1->C1, a2->C1, a3->C1, b1->C2, b2->C2, b3->C2, c1->C3, c2->C3, c3->C3, out->C3

Implementation sketch

while (assignmentsChanged) {
  assign each point to the nearest centroid;
  replace each centroid with the mean of its assigned points;
}

The algorithm stops at a local optimum, not necessarily the best possible clustering.

Correctness intuition and cost

The assignment step cannot increase the objective because each point chooses the cheapest current centroid. The update step cannot increase it because the mean minimizes squared distance inside one fixed cluster.

For n points, k clusters, d dimensions, and t iterations, the common implementation costs O(t n k d).

Common confusions

  • k is chosen before the algorithm runs.
  • K-Means assumes roughly compact, centroid-shaped clusters.
  • Different initial centroids can produce different local optima.
  • The mean may be a point in empty space, not an actual data example.
Which pain does each algorithm repair?The family splits by center assumptions, density assumptions, hierarchy, and soft probability.
K-Means

Move centroids until squared-distance clusters stop changing.

K-Medoids

Use real data points as centers so outliers have less leverage.

DBSCAN

Grow dense neighborhoods and leave isolated points as noise.

OPTICS

Order points by density reachability instead of choosing one fixed radius.

HDBSCAN

Keep clusters that stay stable across density levels.

EM for GMM

Alternate soft membership estimates with Gaussian parameter updates.

Exercises

  1. Why does K-Means use squared distance rather than cluster labels?
  2. What can go wrong if one cluster is long and curved?
  3. Why might two runs with different random starts disagree?

Graph connections : K-Means Clustering