Graph connections

Draft

Davies-Bouldin Index

Score clustering by averaging each cluster's worst spread-versus-separation rival.

concept intermediate machine-learningmetricsclustering

Hook problem: one bad neighbor can matter

A clustering can look acceptable on average while one cluster has a dangerously similar neighbor.

Davies-Bouldin Index asks each cluster: “Which other cluster is my worst rival?”

Worst neighboring clusterDB asks each cluster which other cluster is most similar after combining spread and centroid distance.
Cluster A0.127

worst rival: B

Cluster B0.127

worst rival: A

Cluster C0.123

worst rival: A

DB0.126

lower is better

First naive idea: inspect only cluster spread

Average distance to the centroid tells you whether a cluster is compact.

But a spread of 0.8 is fine if the nearest centroid is 8 units away, and risky if the nearest centroid is 1 unit away.

Core invention: spread relative to centroid separation

For cluster i:

  • S_i is the average distance from its points to centroid i.
  • M_ij is the distance between centroids i and j.
  • R_ij=(S_i+S_j)/M_ij is the similarity between two clusters.

DB keeps the worst R_ij for each cluster, then averages those worst cases.

Formal version

DB=1kimaxjiSi+SjMijDB=\frac{1}{k}\sum_i \max_{j\ne i} \frac{S_i+S_j}{M_{ij}}

Lower DB is better. Small values mean every cluster’s worst rival is still separated compared with their combined spread.

Interactive preset lab

Internal clustering metric preset lab

Explanation: Three compact groups are far apart, so cohesion and separation agree.

Silhouette0.889

higher is better

Calinski-Harabasz251.312

higher is better

Davies-Bouldin0.126

lower is better

Dunn5.358

higher is better

B_k86.004
W_k1.027
Min gap4.418
Max diameter0.825

Implementation sketch

function dbClusterScore(scatterI: number, scatterJ: number, centroidDistance: number) {
  return centroidDistance === 0 ? null : (scatterI + scatterJ) / centroidDistance;
}

Complexity

Computing scatter is O(n) after centroids are known. Comparing all centroid pairs is O(k^2), where k is the number of clusters.

Common confusions

  • Lower DB is better.
  • DB uses centroids and average scatter, so it may miss non-convex cluster shapes.
  • The max is deliberate: each cluster is judged by its worst neighbor.
Internal metric contrastHigher is better for Silhouette, CH, and Dunn; lower is better for DB.
Compact islands

Silhouette: 0.889

CH: 251.312

DB: 0.126

Dunn: 5.358

Stretched clusters

Silhouette: 0.551

CH: 17.953

DB: 0.474

Dunn: 0.903

Bridge point

Silhouette: 0.62

CH: 28.707

DB: 0.446

Dunn: 0.327

Bad split

Silhouette: -0.314

CH: 0.035

DB: 13.062

Dunn: 0.08

Exercises

  1. Why does the formula add S_i + S_j?
  2. Why does a small centroid distance increase DB?
  3. What does the max operation protect against hiding?

Graph connections : Davies-Bouldin Index