Stokastik

Machine Learning, AI and Programming

Initializing cluster centers with K-Means++

In K-Means algorithm, we are not guaranteed of a global minima since our algorithm converges only to a local minima. The local minima and the number of iterations required to reach the local minima, depends on the selection of the initial set of random centroids.

In order to select the initial set of centroids for the K-Means clustering, there are many proposed methods, such as the Scatter and Gather methods, where instead of starting with K random centroids, we start with \sqrt{KN} number of random data points, and then we do an agglomerative clustering to reduce them to K clusters. The centroids of these K clusters now becomes the initial centroids for the K-Means algorithm.

The approach to K-Means clustering do not have a provable upper bound on the objective function \phi.

\phi = \sum_C\sum_{x{\epsilon}C}||x-\mu_C||^2

where C denotes the set of clusters \left\{C_1, C_2, ..., C_k\right\}, and \mu_C denotes the centroid of the cluster C.

There is an improved version of the K-Means algorithm, termed as the K-Means++ algorithm, which has a provable upper bound on the objective function.

The following is the algorithm for K-Means++ for selecting the initial centroids :

Step 1 : Randomly pick one of the data points and add it to the set of initial centroids C.

Step 2 : For each data point X_i assign D(X_i) to be the least distance to any of the centroids from the data point X_i, i.e.

D(X_i) = \text{min}_{j=1}^{|C|} ||X_i-C_j||^2 .

Step 3 : Assign the probability of picking up the data point X_i as the next centroid to be:

\frac{D(X_i)^2}{D(X_1)^2+D(X_2)^2+...+D(X_N)^2}.

This implies that the data point which is the farthest from all the centroids, has greater probability of itself becoming part of the set of centroids.

Step 4 : To choose one of the X_i probabilistically in step 3, generate a number y between

D(X_1)^2 and D(X_1)^2+D(X_2)^2+...+D(X_N)^2,

then find the index 'm' such that

\sum_{j=1}^{m-1}D(X_j)^2\;<\;y\;\le\;\sum_{j=1}^{m}D(X_j)^2.

Then add X_m to the set of centroids C.

Step 5 : Repeat steps 2 to 4 until we have found out K centroids.

After we have found out the set of initial centroids using the K-Means++ algorithm, then we can proceed on with the standard K-Means algorithm to optimize our clusters.

From the following paper on the K-Means++ algorithm, the expected objective function is bounded above by :

E[\phi]\;\le\;8(ln K + 2)\phi_{opt}

where K is the number of clusters and \phi_{opt} is the global minima.

The R implementation of the K-Means algorithm is in the 'stats' package. But the drawbacks with this module is that, it is slow and also it does not chooses the initial set of centroids using K-Means++ algorithm.

Our implementation in C++ returns a set of initial cluster centroids chosen by the K-Means++ algorithm and we could use these initial centroids as parameters to the 'kmeans' method in R.

We have written a K-Means implementation in C++ and R, that uses only sparse data format and chooses the initial set of centroids using the K-Means++ algorithm. We will also show that the C++ implementation is faster and performs better clustering on a labelled data set compared to the 'kmeans' clustering method in R.

struct kmeansOutput {
  std::map<int, int> clusters;
  std::map<int, double> distances;
  std::map<int, std::map<int, double>> distanceHist;
  std::map<int, std::map<int, int>> clustHist;
  std::vector<int> clusterCenters;
};

We declare a 'struct' for the output format. It returns the clusters in the form of a map (Data Point => Cluster Index), the distances of each data point to its corresponding cluster center, also represented in the form of a map.

The 'distanceHist' variable tracks how the distances of each data point to its cluster center evolved with the number of clusters (for K-Means++) and with the number of iterations (for K-Means).

Similarly the variable 'clustHist' tracks the cluster information as it evolves with the number of clusters (for K-Means++) and with number of iterations (for K-Means).

double rowDistances(std::unordered_map<int, double> colValsMap1, std::unordered_map<int, double> colValsMap2) {
  double sum1 = 0, sum2 = 0;
  
  for (auto it = colValsMap1.begin(); it != colValsMap1.end(); ++it) sum1 += (it->second)*(it->second);
  for (auto it = colValsMap2.begin(); it != colValsMap2.end(); ++it) sum2 += (it->second)*(it->second);
  
  double sum = 0;
  
  for (auto it = colValsMap1.begin(); it != colValsMap1.end(); ++it) {
    if (colValsMap2.find(it->first) != colValsMap2.end()) sum += (it->second)*colValsMap2[it->first];
  }
  
  return fabs(1-(sum/(sqrt(sum1)*sqrt(sum2))));
}

The above function computes the cosine distance between two data points. Since we are working with text data which is generally sparse in nature, i.e. each document can contain only a few subset of the entire set of features, distance metrics computed using the cosine similarity is a preferred choice over euclidean distance metrics.

kmeansOutput kmeanspp(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals, 
                      int numCentroids) {
  kmeansOutput out;
  
  std::map<int, std::unordered_map<int, double>> rowColsValsMap;
  std::set<int> rowsSet;
  
  for (size_t i = 0; i < rows.size(); i++) {
    rowColsValsMap[rows[i]][cols[i]] = vals[i];
    rowsSet.insert(rows[i]);
  }
  
  std::vector<int> rowsVec(rowsSet.begin(), rowsSet.end());
  
  std::map<int, std::map<int, double>> distanceCache;
  
  int randomDoc = rand()%(int)rowsVec.size() + rowsVec[0];
  
  out.clusterCenters.push_back(randomDoc);
  
  while((int)out.clusterCenters.size() <= numCentroids) {
    
    for (size_t i = 0; i < rowsVec.size(); i++) out.distances[rowsVec[i]] = 1.0;
    
    for (size_t i = 0; i < rowsVec.size(); i++) {
      for (size_t j = 0; j < out.clusterCenters.size(); j++) {
        
        double d;
        int center = out.clusterCenters[j];
        
        if (distanceCache.find(rowsVec[i]) != distanceCache.end() && distanceCache[rowsVec[i]].find(center) != distanceCache[rowsVec[i]].end()) {
          d = distanceCache[rowsVec[i]][center];
        } else {
          
          d = rowDistances(rowColsValsMap[rowsVec[i]], rowColsValsMap[center]);
          distanceCache[rowsVec[i]][center] = d;
        }
        
        if (d <= out.distances[rowsVec[i]]) {
          out.distances[rowsVec[i]] = d;
          out.clusters[rowsVec[i]] = center;
        }
      }
    }
    
    for (size_t i = 0; i < rowsVec.size(); i++) {
      out.distanceHist[(int)out.clusterCenters.size()][rowsVec[i]] = out.distances[rowsVec[i]];
      out.clustHist[(int)out.clusterCenters.size()][rowsVec[i]] = out.clusters[rowsVec[i]];
    }
    
    std::vector<double> sums;
    
    for (size_t i = 0; i < rowsVec.size(); i++) {
      if (i == 0) sums.push_back(out.distances[rowsVec[i]]*out.distances[rowsVec[i]]);
      else sums.push_back(sums.back() + out.distances[rowsVec[i]]*out.distances[rowsVec[i]]);
    }
    
    if (sums.back() == 0) break;
    
    bool flag = false;
    srand((unsigned)time(NULL));
    
    double y = sums[0] + (double)rand()/(double)(RAND_MAX/(sums.back()-sums[0]));
    
    
    for (int i = 1; i < (int)sums.size(); i++) {
      if (y <= sums[i] && y > sums[i-1]) {
        out.clusterCenters.push_back(rowsVec[i]);
        flag = true;
        break;
      }
    }
    
    if (!flag) out.clusterCenters.push_back(rowsVec.back());
  }
  
  out.clusterCenters.pop_back();
  
  return out;
}

// [[Rcpp::export]]
List cpp__kmeanspp(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals, int numCentroids) {
  
  kmeansOutput out = kmeanspp(rows, cols, vals, numCentroids);
  
  return List::create(_["clusters"]=out.clusters, _["clusterCenters"]=out.clusterCenters, 
                      _["distances"]=out.distances, 
                      _["distanceHistory"]=out.distanceHist,
                      _["clusterHistory"]=out.clustHist);
}

The next two functions define the K-Means++ algorithm. Note that since we are working with sparse matrix format in R, each entry of a sparse matrix is the triplet (i, j, v) format, where 'i' denotes the row number, 'j' denotes the column number and 'v' the value at the (i, j) index. The input vectors 'rows' represents the vector of 'i' values, 'cols' represents the vector of 'j' values and 'vals' represents the vector of 'v' values, hence all these vectors are of same length.

In the 'kmeanspp' function, we have defined a variable 'distanceCache[][]' that stores computed distance between two data points. The logic behind this is to remove redundant computations for distance between the same two data points in multiple iterations and hence it helps to achieve speedup. Note that in the subsequent functions we have extensively used the 'map' data structure of C++ due to which the code might look clumsy sometimes, but it helps to achieve O(1) complexities for seek and search.

// [[Rcpp::export]]
List cpp__kmeans(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals, int numCentroids, int maxNumIter=10) {
  
  kmeansOutput out = kmeanspp(rows, cols, vals, numCentroids);
  out.distanceHist.clear();
  out.clustHist.clear();
    

  std::map<int, int> clusters = out.clusters;
  std::vector<int> clusterCenters = out.clusterCenters;
  
  std::map<int, std::unordered_map<int, double>> rowColsValsMap;
  std::set<int> rowsSet;
  
  for (size_t i = 0; i < rows.size(); i++) {
    rowColsValsMap[rows[i]][cols[i]] = vals[i];
    rowsSet.insert(rows[i]);
  }
  
  std::vector<int> rowsVec(rowsSet.begin(), rowsSet.end());
  
  std::map<int, std::vector<int>> clusts0, clusts;
  std::map<int, std::unordered_map<int, double>> centers;
  
  for (size_t i = 0; i < clusterCenters.size(); i++) centers[i+1] = rowColsValsMap[clusterCenters[i]];

  for (auto it = clusters.begin(); it != clusters.end(); ++it) clusts0[it->second].push_back(it->first);
  
  for (size_t i = 0; i < clusterCenters.size(); i++) 
    clusts[i+1].insert(clusts[i+1].end(), clusts0[clusterCenters[i]].begin(), clusts0[clusterCenters[i]].end());
  
  
  double lastSumDistance = std::numeric_limits<double>::max();
  int numIter = 0;
  
  while(numIter < maxNumIter) {
    
    for (auto it = clusts.begin(); it != clusts.end(); ++it) {
      std::vector<int> r = it->second;
      std::unordered_map<int, double> centroid;

      if ((int)r.size() > 1) {
        for (size_t i = 0; i < r.size(); i++) {
          std::unordered_map<int, double> p1 = rowColsValsMap[r[i]];
          for (auto it2 = p1.begin(); it2 != p1.end(); ++it2) centroid[it2->first] += it2->second;
        }
        for (auto it2 = centroid.begin(); it2 != centroid.end(); ++it2) centroid[it2->first] = (it2->second)/(double)r.size();
        
      } else centroid = rowColsValsMap[r.front()];
      
      centers[it->first] = centroid;
    }
    
    for (size_t i = 0; i < rowsVec.size(); i++) out.distances[rowsVec[i]] = 1.0;
    
    for (size_t i = 0; i < rowsVec.size(); i++) {
      for (auto it = centers.begin(); it != centers.end(); ++it) {
        
        std::unordered_map<int, double> center = it->second;
        
        double d = rowDistances(rowColsValsMap[rowsVec[i]], center);

        if (d <= out.distances[rowsVec[i]]) {
          out.distances[rowsVec[i]] = d;
          out.clusters[rowsVec[i]] = it->first;
        }
      }
    }
    
    for (size_t i = 0; i < rowsVec.size(); i++) {
      out.distanceHist[numIter][rowsVec[i]] = out.distances[rowsVec[i]];
      out.clustHist[numIter][rowsVec[i]] = out.clusters[rowsVec[i]];
    }
    
    for (auto it = clusts.begin(); it != clusts.end(); ++it) clusts.erase(it);
    for (auto it = out.clusters.begin(); it != out.clusters.end(); ++it) clusts[it->second].push_back(it->first);
    
    double sumDistances = 0.0;
    for (size_t i = 0; i < rowsVec.size(); i++) sumDistances += out.distances[rowsVec[i]];
    
    if (lastSumDistance - sumDistances < 1.0) break;
    lastSumDistance = sumDistances;
    
    numIter++;
  }
  
  return List::create(_["clusters"]=out.clusters, 
                      _["distances"]=out.distances, 
                      _["distanceHistory"]=out.distanceHist,
                      _["clusterHistory"]=out.clustHist);
}

Given a sparse (simple triplet) matrix 'dtm' in R, to obtain the 'kmeans' clustering results, we need to call :

cpp__kmeans(dtm$i, dtm$j, dtm$v, k, 5)

where we are passing that maximum number of iterations should be 5. In the 'cpp__kmeans' function we have defined that if the sum of the distances of the data points from its cluster centroids changes by at-most 1 from the previous iteration then stop doing any further iterations.

To compute the effectiveness of the clustering algorithm, we use a labelled data-set, with 6 classes and each class containing 20 examples for a total of 120 examples. The total number of features present is around 3600. After we cluster these 120 examples with K=6 clusters, then for each cluster, we compute the entropy of each cluster based on their class representations. For example, if a cluster contains 5 examples from class 1, 8 from class 2 and 7 examples from class 3, then the class probabilities are 0.25, 0.40 and 0.35 respectively for classes 1, 2 and 3, hence the entropy of the cluster is :

-0.25*log2(0.25)-0.40*log2(0.40)-0.35*log2(0.35) = 1.56

whereas if a cluster contains 18 examples from class 1, 1 from class 2 and 1 from class 3, then the entropy is 0.57. If a cluster contains examples from only one class then the entropy of that cluster is 0. Our goal is to minimize the weighted entropy of the clusters, i.e.

\frac{\sum_{i=1}^{|C|}|C_i|*H(C_i)}{N}

where |C| denotes the number of clusters, |C_i| denotes the cardinality of cluster C_i and H(C_i) denotes the entropy of cluster C_i defined as above.

With our data-set, when we call the 'cpp__kmeans' function with number of clusters set to 6 and number of iterations 10, the algorithm returns in the 4th iteration itself.

The weighted entropy of the 6 clusters after the 4 iterations is 0.60.

But since the selection of the initial cluster centroids is random, it means for different runs of the K-Means algorithm, we would get different values of the weighted entropy. Hence to get a rough estimate of the weighted entropy, we would run the K-Means algorithm 20 times and then take the average.

To compare 'cpp__kmeans' function with the 'kmeans' function in R, we benchmark on the time taken to create the final clusters as well as the weighted entropy averaged over 20 runs.

We varied the number of classes from 5 to 20, keeping the number of examples per class fixed at 20. The following results are obtained :

     Time cpp__kmeans Time R kmeans Wtd. Entropy cpp__kmeans Wtd. Entropy R kmeans
 [1,] 0.2966           1.0066        1.058723                 0.9374297            
 [2,] 0.5495           1.11055       0.9301231                1.189407             
 [3,] 0.8173           1.6226        1.132335                 2.007462             
 [4,] 1.07955          2.65975       1.247047                 1.341089             
 [5,] 1.39065          2.3786        0.9297324                1.37014              
 [6,] 1.88095          3.15245       1.014562                 1.194692             
 [7,] 1.8723           3.88365       0.8298964                1.142815             
 [8,] 2.55065          5.27285       1.173092                 1.761781             
 [9,] 3.141            6.1843        1.200879                 1.852063             
[10,] 3.54175          7.1408        1.278477                 1.836246             
[11,] 3.4614           7.0622        0.9113705                1.42197              
[12,] 5.45685          12.9538       1.170585                 1.812935             
[13,] 4.89925          11.2324       1.039075                 1.713827             
[14,] 5.80825          14.41885      1.280135                 1.909693             
[15,] 8.46155          15.976        1.380371                 2.206767             
[16,] 8.2514           16.89865      1.332589                 1.984053

The first column gives the time taken to create the clusters using the cpp__kmeans algorithm for cluster sizes 5 to 20 (number of data points 100 to 400), the second column gives the time taken to create the clusters using the kmeans method in R.

Clearly for each value of K, the cpp__kmeans method outperforms the kmeans method in R in terms of speed.

The third column in the above table represents the weighted entropy measures for each value of K with cpp__kmeans algorithm, whereas the fourth columns represents the weighted entropy measures for each value of K with R's kmeans method.

Except for the first row K=5 (still quite comparable), for all other values of K, the cpp__kmeans algorithm again outperforms the kmeans method in R in terms of clustering efficiency. We can improve this performances further by doing careful feature selection and dimensionality reduction such as PCA, LDA or NMF, which we shall explore in future posts.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , ,