Graph connections

Draft

Purity

Score a clustering by giving each predicted cluster credit for its largest reference-label majority.

concept beginner machine-learningmetricsclustering

Hook problem: clusters have no built-in names

A clustering algorithm groups eight learning cards. A teacher also has an answer key: each card is really about graph, tree, or hash.

The cluster names A, B, C, and D are arbitrary. Cluster A does not mean “graph” just because the algorithm printed A.

Reference labels versus clustersCluster names are arbitrary; only membership should be compared.
Ada

reference: graph

cluster: A

Ben

reference: graph

cluster: A

Cy

reference: graph

cluster: B

Di

reference: tree

cluster: B

Eli

reference: tree

cluster: C

Fay

reference: tree

cluster: C

Gus

reference: hash

cluster: D

Han

reference: hash

cluster: D

First naive idea: compare cluster names directly

If we treated A, B, C, and D as class labels, the score would depend on random names chosen by the clustering algorithm.

That is not a stable metric. Renaming cluster A to Z should not change the evaluation.

Core invention: each cluster gets its best label

Purity asks each predicted cluster one local question:

Which reference label is the majority inside this cluster?

Then it adds the majority counts and divides by the number of items.

Cluster-majority ledgerEach cluster contributes only its largest reference-label count.
Cluster A2/2

majority label: graph

Cluster B1/2

majority label: graph

Cluster C2/2

majority label: tree

Cluster D2/2

majority label: hash

Purity7/8 = 0.875

Contingency table: 4 clusters by 3 reference labels.

Formal version

Let C_k be predicted cluster k, L_j be reference label j, and n be the number of items.

Purity=1nkmaxjCkLj\text{Purity}=\frac{1}{n}\sum_k \max_j |C_k \cap L_j|

Plainly: for each cluster, keep only its largest answer-key pile.

For the fixture:

2+1+2+28=78=0.875\frac{2 + 1 + 2 + 2}{8}=\frac{7}{8}=0.875

Interactive preset lab

Use the presets to compare the same clustering result under Purity, Rand Index, Adjusted Rand Index, and Fowlkes-Mallows Index.

Clustering metric preset lab

Explanation: One mixed cluster and two split true classes make Purity look high while pair metrics reveal damage.

Purity0.875

7/8

Rand Index0.821

23/28

Adjusted Rand Index0.444

S=3, E=1

Fowlkes-Mallows Index0.567

pair P=0.75, pair R=0.429

TP3
FP1
FN4
TN20

Static no-JS fallback:

Purity cluster contributions
ClusterMajority labelContribution
Agraph2/2
Bgraph1/2
Ctree2/2
Dhash2/2

Where Purity breaks

Purity rewards clean clusters, but it does not penalize splitting enough.

If every item becomes its own singleton cluster, every cluster is perfectly pure, so Purity becomes 1.0. That does not mean the clustering recovered the useful groups.

Preset contrastThe same four metrics react differently to over-splitting and merging.
Fixture

Purity: 0.875

RI: 0.821

ARI: 0.444

FMI: 0.567

Perfect match

Purity: 1

RI: 1

ARI: 1

FMI: 1

Singleton over-split

Purity: 1

RI: 0.75

ARI: 0

FMI: not available

All merged

Purity: 0.375

RI: 0.25

ARI: 0

FMI: 0.5

Implementation sketch

function purity(items: { cluster: string; label: string }[]) {
  const counts = new Map<string, Map<string, number>>();

  for (const item of items) {
    const row = counts.get(item.cluster) ?? new Map<string, number>();
    row.set(item.label, (row.get(item.label) ?? 0) + 1);
    counts.set(item.cluster, row);
  }

  let kept = 0;
  for (const row of counts.values()) {
    kept += Math.max(...row.values());
  }

  return items.length === 0 ? null : kept / items.length;
}

Complexity

Building the cluster-by-label table takes O(n) time for n items. The extra space is O(rc) for r predicted clusters and c reference labels that actually appear.

Common confusions

  • Purity is an external clustering metric because it needs reference labels.
  • Purity is not accuracy: cluster names are arbitrary, so clusters are first matched locally to majority labels.
  • High Purity does not prove a good clustering when the algorithm over-splits.
Clustering metric pathStart with cluster majorities, then move to pair counts and pair-positive balance.
1. Purity

purity

2. Rand Index

rand-index

3. Adjusted Rand Index

adjusted-rand-index

4. Fowlkes-Mallows

fowlkes-mallows-index

Exercises

  1. Which fixture cluster causes the 1/2 contribution?
  2. Why does singleton over-splitting get Purity 1.0?
  3. What would happen to Purity if clusters C and D were merged?

Graph connections : Purity