Stokastik

Machine Learning, AI and Programming

Selecting the optimum number of clusters

Clustering algorithms comes with lots of challenges. For centroid based clustering algorithms like K-Means, the primary challenges are :

  • Initialising the cluster centroids.
  • Choosing the optimum number of clusters.
  • Evaluating clustering quality in the absence of labels.
  • Reduce dimensionality of data.

In this post we will focus on different ways of choosing the optimum number of clusters. The basic idea is to minimize the sum of the within cluster sum of squares for all the clusters.

Within Cluster Sum of Squares (WSS) for Cluster k = \sum_{i=1}^{|C_k|}||x_i-{\mu}_k||^2

where x_i are the instances in cluster k and {\mu}_k is the centroid for cluster k and |C_k| is the cardinality of the cluster k.

Compute the sum of the WSS for all clusters and minimize it. If you continue to increase the number of clusters, the sum of WSS will decrease. But with too many clusters, the model will overfit (think of the scenario where each cluster corresponds to one instance).

The number of clusters can be considered as a regularization factor to the loss function :

L=\sum_{k=1}^{|C|}\sum_{i=1}^{|C_k|}||x_i-{\mu}_k||^2+\frac{1}{2}{\lambda}|C|^2

Following are some of the methods commonly deployed to choose the optimum number of clusters.

Within Cluster Sum of Squares

In this method, we iterate through a probable values of K and for each probable value, we compute the WSS of each cluster. Then we plot the sum with K and see where there is a "sharp bent" or "elbow" in the curve. Here is an R code to do that.

data(mtcars)
mydata = na.omit(mtcars)
mydata = scale(mydata)

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

 

Variation of the sum of within cluster sum of squares vs. the number of clusters.

Variation of the sum of within cluster sum of squares vs. the number of clusters.

As we see from the curve above that the sum of within cluster sum of squares decrease sharply at K=2 till K=4. If we assume that there were at-least K=2 clusters to begin with, then our answer in this case would be K=4 is the optimal number of clusters.

Information Criterion : Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

AIC is used to select the best statistical model for the data, from possible candidate models. It assumes that the data is generated by some statistical model. For example, in EM clustering we had assumed that the examples were generated from a Gaussian Mixture Model. The AIC value is calculated as :

AIC=2k-2ln(L)

where k is the number of free parameters for the model and L is the maximum value of the likelihood of the data. Our goal is to minimize the AIC value.

The AIC method aims to reward any model whose likelihood of generating the data is high but at the same time it penalizes model which uses a very high number of free parameters to attain this likelihood (since high number of parameters overfits the model to the data).

In the case of K-Means algorithm, if we assume that the clusters are generated from a Gaussian model, then the likelihood of the data is :

L=(\sqrt{2\pi}\sigma)^{-N}*e^{-\frac{\sum_{i=1}^{N}(X_i-\mu)^2}{2{\sigma}^2}}

ln(L)=-Nln(\sqrt{2\pi}\sigma)-\frac{\sum_{i=1}^{N}(X_i-\mu)^2}{2{\sigma}^2}

where \sigma is the standard deviation of the observed data and could be considered as a constant for a model since it is independent of the clusters themselves. The above likelihood is maximized for the the K-Means algorithm when the cluster means are selected in such a way such that the sum of the within cluster sum of squares is minimized, or in other words, the maximum value of the likelihood function above is :

ln(L)_{\text{max}}=-Nln(\sqrt{2\pi}\sigma)-\frac{\sum_{j=1}^{K}\sum_{X_i{\epsilon}S_j}(X_i-\mu_j)^2}{2{\sigma}^2}

The number of free parameters for the model is K*d, i.e. the dimensionality of the cluster centroids. Omitting the constant terms from the above equation :

AIC=2*K*d+\sum_{j=1}^{K}\sum_{X_i{\epsilon}S_j}(X_i-\mu_j)^2

The second term in the above equation is the sum of the within cluster sum of squares. In R code :

data(mtcars)
mydata = na.omit(mtcars)
mydata = scale(mydata)

aic <- sapply(seq(1:15), function(k) {
  fit <- kmeans(mydata, centers=k)
  wss <- if (k==1) (nrow(mydata)-1)*sum(apply(mydata,2,var)) else sum(fit$withinss)
  d <- ncol(fit$centers)
  2*k*d+wss
})

plot(1:15, aic, type="b", xlab="Number of Clusters", ylab="AIC")
Rplot01

Variation of the AIC values with number of clusters.

It is pretty clear from the above curve, that the ideal number of clusters should be K=4.

Closely related to AIC is BIC or (Bayesian Information Criterion), but in BIC the penalty for a complex model (k) is more. It is formulated as :

BIC=k*ln(N)-2ln(L)

where as before k denotes the number of free parameters in the model, L is the maximum likelihood that the data was generated by the model and N is the number of examples. For the K-Means algorithm, we have :

BIC=ln(N)*K*d+\sum_{j=1}^{K}\sum_{X_i{\epsilon}S_j}(X_i-\mu_j)^2

data(mtcars)
mydata = na.omit(mtcars)
mydata = scale(mydata)

bic <- sapply(seq(1:15), function(k) {
  fit <- kmeans(mydata, centers=k)
  wss <- if (k==1) (nrow(mydata)-1)*sum(apply(mydata,2,var)) else sum(fit$withinss)
  d <- ncol(fit$centers)
  log(nrow(mydata))*k*d+wss
})

plot(1:15, bic, type="b", xlab="Number of Clusters", ylab="BIC")

 

Variation of the BIC values with the number of clusters.

From the above curve, it is also apparent that the optimal value for K=4.

If the dataset is small, we can iterate over a larger range of K, but if the dataset is huge, we have to use some heuristics to choose the range for K.

One important thing to note is that, the clusters formed during K-Means Clustering depends on the initialized cluster centers. The final set of clusters may vary with multiple runs of the K-means algorithm. K-Means algorithm only converges to a local minima. Hence to estimate the optimal number of clusters K, run the AIC and BIC algorithms multiple times and take the value of K that is the most optimal in maximum set of runs.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , ,