Graph connections

Draft

K-Medoids Clustering

Cluster around representative data points instead of averaged centroids, making center choice more robust to outliers.

algorithm intermediate machine-learningclusteringalgorithms

Hook problem: the mean can drift into empty space

K-Means uses an average point as a center. That is efficient, but a single faraway point can pull the mean toward a place that is not representative of any actual example.

K-Medoids asks for a stricter kind of center: choose real data points.

Centers can be real pointsK-Medoids pays distance to chosen examples, so an outlier center is visibly expensive.
a1a2a3b1b2b3c1c2c3out
representative medoids9.38

medoids: a2, b2, c2

outlier as medoid15.88

medoids: a2, b2, out

bad swap candidate10.11

medoids: a1, b1, c1

First naive idea: keep K-Means and ignore the outlier

Ignoring the outlier by hand is not an algorithm. If the data changes, the rule changes too.

The smallest repair is to change what a center is allowed to be.

Core invention: medoids are examples

A medoid is a selected data point whose cluster members are close to it. K-Medoids usually minimizes a sum of distances:

i=1nd(xi,mci)\sum_{i=1}^{n} d(x_i, m_{c_i})

Here m_c is the medoid for cluster c. Because medoids must be real examples, the chosen center is inspectable and less easily dragged by an extreme coordinate.

Interactive trace lab

Clustering algorithm trace lab

Step 1/2: Centers must be existing examples, not averaged coordinates.

pick medoids

Use real data points as centers so outliers have less leverage.

candidate

a2, b2, c2

Implementation sketch

choose k initial medoids;
repeat {
  assign each point to its nearest medoid;
  try swapping a medoid with a non-medoid point;
  keep the swap if total distance decreases;
}

Correctness intuition and cost

The swap version is a local search: every accepted swap lowers the chosen distance objective, so the process eventually stops when no tested swap helps.

It is usually more expensive than K-Means because swap candidates require many distance evaluations. The benefit is that it works with any distance function and keeps centers as real examples.

Common confusions

  • A medoid is not the same as a centroid.
  • K-Medoids still needs k in advance.
  • It is more robust to outliers, but not immune to bad k or bad distance choices.
Clustering algorithm pathStart with centroid partitions, repair center robustness, then move to density and probability.
1. K-Means

k-means

2. K-Medoids

k-medoids

3. DBSCAN

dbscan

4. OPTICS

optics

5. HDBSCAN

hdbscan

6. EM for GMM

em-for-gmm

Exercises

  1. Why might a real example be easier to explain than a mean vector?
  2. Which step makes K-Medoids slower than K-Means?
  3. When would a non-Euclidean distance make K-Medoids attractive?

Graph connections : K-Medoids Clustering