Quality Measures for Clustering#

Evaluating the quality of a clustering is not as simple as it looks at first glance. Almost always we do not have a ground truth at hand (external evaluation). Instead we can only look at size and shape of clusters themselves (internal evaluation). There exist lots of internal evaluation metrics. Below we only consider two examples. There is no best metric, because cluster evaluation heavily depends on the intended application. The only reliable evaluation metric is human inspection. But human inspection is restricted to visualizations, which are not available for high dimensional data sets. Dimensionality reduction techniques may help.

In the following we represent clusters as subsets of our data set. In other words, a cluster \(C\) is a subset of \(\{x_1,\ldots,x_n\}\).

Silhouette Score#

The silhouette score relates intra-cluster distances to inter-cluster distances. It is defined for each sample of a data set. The silhouette score of a whole data set then is the mean score of all samples.

Given some distance measure \(d:X\times X\rightarrow[0,\infty)\) (usually Euclidean distance) the intra-cluster distance of a sample \(x\) and the cluster \(C\) the sample belongs to is the mean distance of the sample to all other samples in the cluster:

\[\begin{equation*} \text{intra}(x, C):=\frac{1}{|C|-1}\,\sum_{\tilde{x}\in C}d(x,\tilde{x}). \end{equation*}\]

The inter-cluster distance of a sample \(x\) to a cluster \(\tilde{C}\) the sample does not belong to is the mean distance of the sample to all samples in the cluster under consideration:

\[\begin{equation*} \text{inter}(x, \tilde{C}):=\frac{1}{|\tilde{C}|}\,\sum_{\tilde{x}\in\tilde{C}}d(x,\tilde{x}). \end{equation*}\]

Now the silhouette score of a sample \(x\) is the ratio of the intra-cluster distance and the smallest inter-cluster-distance:

\[\begin{equation*} a:=\text{intra}(x,C),\quad b:=\min_{\tilde{C}\neq C}\text{inter}(x,\tilde{C}),\quad \text{silhouette}(x):=\frac{b-a}{\max\{a,b\}}. \end{equation*}\]

If \(x\) is the only element in \(C\), then this formula does not work and one sets the silhouette score to zero.

Silhouette score always lie in \([-1,1]\). It is the higher the lower the intra-cluster distance is and the higher the inter-cluster distance is. Thus, high silhouette score for a sample indicates that it belongs to a cluster well separated from all other clusters. Score close to 0 indicates that the sample belongs to overlapping clusters. A score close to -1 indicates a missclustering (sample is closer to other clusters than to its own cluster).

Silhouette score of a data set represents the average clustering quality. Many missclustered samples result in negative silhouette score and so on. Note, that a silhouette score close to zero may indicate that half the samples have been missclustered as well as that there is no clustering (all clusters heavily overlap).

different silhouette scores

Fig. 58 Silhouette scores for different clustering results.#

Important

Silhouette score depends on the chosen distance and, thus, on scaling of each feature. If some feature has much higher numerical values than other features, then that feature will dominate the distances.

Another noteworthy point is that silhouette scores are more reliable if clusters are convex. Else inter-cluster distances might be smaller than intra-cluster distances although clusters were correctly identified.

nonconvex clusters

Fig. 59 For non-convex clusters silhouette score may indicate bad clustering results although clusters have been identified correctly.#

Davies-Bouldin Index#

The Davies-Bouldin index relates cluster diameters to distances between clusters. It is defined for each pair of clusters. The Davies-Bouldin index of a single cluster is the worst Davies-Bouldin index of each pair containing the cluster under consideration. The Davies-Bouldin index of a whole clustering is the mean Davies-Bouldin index of all clusters.

To define the Davies-Bouldin index we need the centroid of a cluster \(C\). It’s the coordinatewise arithmetic mean of all samples in the cluster:

\[\begin{equation*} \text{cent}(C):=\frac{1}{|C|}\,\sum_{x\in C}x. \end{equation*}\]

The cluster radius can be defined as the mean distance of samples to the cluster’s centroid:

\[\begin{equation*} r(C):=\frac{1}{|C|},\sum_{x\in C}d\bigl(x,\text{cent}(C)\bigr) \end{equation*}\]

with some distance measure \(d\). Usually \(d\) is the Euclidean distance, because the notion of centroid is based on considerations involving Euclidean distances. For other distance measures introducing a sensible notion of centroids is difficult. Given two clusters \(C_1\) and \(C_2\) we may define the cluster distance as the distance of their centroids:

\[\begin{equation*} \text{dist}(C_1,C_2):=d\bigl(\text{cent}(C_1),\text{cent}(C_2)\bigr). \end{equation*}\]

The Davies-Bouldin index of two clusters \(C_1\) and \(C_2\) is

\[\begin{equation*} \text{DB}(C_1,C_2):=\frac{r(C_1)+r(C_2)}{\text{dist}(C_1,C_2)}. \end{equation*}\]

It takes values in \([0,\infty)\) and is the closer to zero the smaller the clusters are and the higher the distance between clusters is.

Davies-Bouldin index of two clusters

Fig. 60 Davies-Bouldin index relates distance between clusters to cluster diameters.#

A Davies-Bouldin index above 1 indicates overlapping clusters (at least if the clusters’ shapes are close to spheres). If clusters are not sphere shaped the Davies-Boulding index does not yield useful information.

Davies-Bouldin index for clusters not sphere shaped

Fig. 61 If clusters are not sphere shaped Davies-Bouldin index may indicate bad clustering although clustering is correct.#

Note that the Davies-Bouldin index of two clusters is symmetric, that is, does not depend on the ordering of the clusters. If there are more than two clusters, the Davies-Bouldin index of each cluster is

\[\begin{equation*} \text{DB}(C):=\max_{\tilde{C}\neq C}\text{DB}(C,\tilde{C}) \end{equation*}\]

and the Davies-Bouldin index of the whole clustering is mean Davies-Bouldin index of all clusters.