Graph connections

Draft

Calinski-Harabasz Index

Score clustering by comparing between-cluster centroid scatter with within-cluster scatter.

concept intermediate machine-learningmetricsclustering

Hook problem: summarize the whole geometry

Silhouette score looks point by point. Calinski-Harabasz Index asks for a coarser summary:

Are cluster centers far apart compared with how spread out their own points are?

It is still an internal clustering metric. It uses geometry and cluster assignments, not reference labels.

Between scatter versus within scatterCH rewards centroids that are far from the global center while points stay close to their own centroid.
p1p2p3p4p5p6p7p8p9
B_k86.004

between-cluster scatter

W_k1.027

within-cluster scatter

CH251.312

higher is better

First naive idea: use only within-cluster spread

Small within-cluster spread is good, but it is not enough.

Three tiny clusters sitting almost on top of each other should not receive the same judgment as three tiny clusters separated across the plane.

Core invention: between scatter divided by within scatter

CH uses two sums:

  • W_k: within-cluster scatter, the sum of squared distances from each point to its own cluster centroid.
  • B_k: between-cluster scatter, the cluster-size-weighted squared distance from each cluster centroid to the global centroid.

Formal version

For n points and k clusters:

CH=Bk/(k1)Wk/(nk)CH=\frac{B_k/(k-1)}{W_k/(n-k)}

Plainly: the numerator rewards cluster centers that are far apart; the denominator penalizes points that wander far from their own centers.

Higher CH is better. The score is unavailable when k < 2, when n <= k, or when within-cluster scatter is zero.

Interactive preset lab

Internal clustering metric preset lab

Explanation: Three compact groups are far apart, so cohesion and separation agree.

Silhouette0.889

higher is better

Calinski-Harabasz251.312

higher is better

Davies-Bouldin0.126

lower is better

Dunn5.358

higher is better

B_k86.004
W_k1.027
Min gap4.418
Max diameter0.825

Implementation sketch

function calinskiHarabasz(between: number, within: number, n: number, k: number) {
  if (k < 2 || n <= k || within === 0) return null;
  return (between / (k - 1)) / (within / (n - k));
}

Complexity

After assignments are known, computing centroids and squared-distance sums is O(n) for fixed-dimensional points. If dimensions are variable, include the dimension factor.

Common confusions

  • CH is higher-better, unlike Davies-Bouldin.
  • CH is not an external score; it does not know the true labels.
  • Squared distances make far-off points matter more than small local noise.
Internal metric contrastHigher is better for Silhouette, CH, and Dunn; lower is better for DB.
Compact islands

Silhouette: 0.889

CH: 251.312

DB: 0.126

Dunn: 5.358

Stretched clusters

Silhouette: 0.551

CH: 17.953

DB: 0.474

Dunn: 0.903

Bridge point

Silhouette: 0.62

CH: 28.707

DB: 0.446

Dunn: 0.327

Bad split

Silhouette: -0.314

CH: 0.035

DB: 13.062

Dunn: 0.08

Exercises

  1. Which part of the formula grows when cluster centroids move farther apart?
  2. Which part grows when each cluster gets stretched?
  3. Why is n <= k a degenerate case?

Graph connections : Calinski-Harabasz Index