Draft
Calinski-Harabasz Index
Score clustering by comparing between-cluster centroid scatter with within-cluster scatter.
Hook problem: summarize the whole geometry
Silhouette score looks point by point. Calinski-Harabasz Index asks for a coarser summary:
Are cluster centers far apart compared with how spread out their own points are?
It is still an internal clustering metric. It uses geometry and cluster assignments, not reference labels.
between-cluster scatter
within-cluster scatter
higher is better
First naive idea: use only within-cluster spread
Small within-cluster spread is good, but it is not enough.
Three tiny clusters sitting almost on top of each other should not receive the same judgment as three tiny clusters separated across the plane.
Core invention: between scatter divided by within scatter
CH uses two sums:
W_k: within-cluster scatter, the sum of squared distances from each point to its own cluster centroid.B_k: between-cluster scatter, the cluster-size-weighted squared distance from each cluster centroid to the global centroid.
Formal version
For n points and k clusters:
Plainly: the numerator rewards cluster centers that are far apart; the denominator penalizes points that wander far from their own centers.
Higher CH is better. The score is unavailable when k < 2, when n <= k, or when within-cluster scatter is zero.
Interactive preset lab
Internal clustering metric preset lab
Explanation: Three compact groups are far apart, so cohesion and separation agree.
higher is better
higher is better
lower is better
higher is better
Implementation sketch
function calinskiHarabasz(between: number, within: number, n: number, k: number) {
if (k < 2 || n <= k || within === 0) return null;
return (between / (k - 1)) / (within / (n - k));
}
Complexity
After assignments are known, computing centroids and squared-distance sums is O(n) for fixed-dimensional points. If dimensions are variable, include the dimension factor.
Common confusions
- CH is higher-better, unlike Davies-Bouldin.
- CH is not an external score; it does not know the true labels.
- Squared distances make far-off points matter more than small local noise.
Silhouette: 0.889
CH: 251.312
DB: 0.126
Dunn: 5.358
Silhouette: 0.551
CH: 17.953
DB: 0.474
Dunn: 0.903
Silhouette: 0.62
CH: 28.707
DB: 0.446
Dunn: 0.327
Silhouette: -0.314
CH: 0.035
DB: 13.062
Dunn: 0.08
Exercises
- Which part of the formula grows when cluster centroids move farther apart?
- Which part grows when each cluster gets stretched?
- Why is
n <= ka degenerate case?
Graph connections : Calinski-Harabasz Index