Draft
K-Means Clustering
Partition points by repeatedly assigning them to nearest centroids and moving each centroid to its assigned mean.
Hook problem: make piles without labels
Suppose you have points on a map but no answer key. You only know you want k = 3 compact groups.
A naive idea is to draw three circles by eye. K-Means turns that drawing habit into a repeatable loop.
Assign every point to the nearest current centroid by squared distance.
Move each centroid to the mean of the points assigned to it.
Reassign with the moved centroids; only points near boundaries can change.
Assignments and centroids agree for this fixture, so the local optimum is stable.
First naive idea: choose centers once
If the first centers are a little wrong, many points are assigned to the wrong pile. Keeping those centers fixed bakes in the mistake.
The repair is small: after assigning points, let the centers move.
Core invention: assignment then update
K-Means alternates two moves:
- Assign every point to its nearest centroid.
- Move each centroid to the mean of the points assigned to it.
The objective it tries to reduce is the within-cluster squared error:
In words: each point pays the squared distance to the centroid of its assigned cluster.
Interactive trace lab
Clustering algorithm trace lab
Step 1/4: Assign every point to the nearest current centroid by squared distance.
Move centroids until squared-distance clusters stop changing.
C1=(1, 1); C2=(6.20, 1); C3=(3.40, 5.60)
a1->C1, a2->C1, a3->C1, b1->C2, b2->C2, b3->C2, c1->C3, c2->C3, c3->C3, out->C3
Implementation sketch
while (assignmentsChanged) {
assign each point to the nearest centroid;
replace each centroid with the mean of its assigned points;
}
The algorithm stops at a local optimum, not necessarily the best possible clustering.
Correctness intuition and cost
The assignment step cannot increase the objective because each point chooses the cheapest current centroid. The update step cannot increase it because the mean minimizes squared distance inside one fixed cluster.
For n points, k clusters, d dimensions, and t iterations, the common implementation costs O(t n k d).
Common confusions
kis chosen before the algorithm runs.- K-Means assumes roughly compact, centroid-shaped clusters.
- Different initial centroids can produce different local optima.
- The mean may be a point in empty space, not an actual data example.
Move centroids until squared-distance clusters stop changing.
Use real data points as centers so outliers have less leverage.
Grow dense neighborhoods and leave isolated points as noise.
Order points by density reachability instead of choosing one fixed radius.
Keep clusters that stay stable across density levels.
Alternate soft membership estimates with Gaussian parameter updates.
Exercises
- Why does K-Means use squared distance rather than cluster labels?
- What can go wrong if one cluster is long and curved?
- Why might two runs with different random starts disagree?
Graph connections : K-Means Clustering