Machine Learning, AI and Programming

Feature Selection with Mutual Information

Given two random variables X and Y, mutual information measures how much knowing one of these variables reduces uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice-verse, so their mutual information is zero.

At the other extreme, if X is completely correlated with Y then all information conveyed by X is also conveyed by Y, i.e. knowing X determines the value of Y and vice-verse. Mathematically, mutual information between two random discrete variables X and Y is formulated as :

 MI(X,Y) = \sum_{i,j} P(X_i, Y_j) * log_2 \frac{P(X_i, Y_j)}{P(X_i)P(Y_j)}

If the above random variables are continuous then the sum is replaced with double integral. Mutual Information is related to entropy of the random variables as :

 MI(X,Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

Where H(X) and H(Y) are the individual entropies of the random variable X and Y. H(X|Y) denotes the entropy or the uncertainty of the random variable X given that we know Y and similarly H(Y|X). The quantity H(X) - H(X|Y) represents the entropy or uncertainty remaining in X even after Y is known. This is exactly what we have defined Mutual Information to be in the first paragraph. If X and Y are independent, then H(X|Y)=H(X), thus MI is 0 and if X and Y are perfectly correlated, then H(X|Y)=0, since if we know Y then no entropy remains in X, and thus MI is H(X) in this case.

Consider a news classification example. Suppose we are given to classify a news article based on its contents into either Sports or Entertainment category. Assuming that we have a lot of training examples belonging to each of the categories, features (words, phrases etc.) found in those training examples are not equally informative.

For example the word "coach" might be important for Sports articles compared to Entertainment and thus is very informative whereas the word "audience" might be  equally important for both categories and thus not informative. Using large number of features from the training examples have several disadvantages like large memory requirements, dilution of important features and so on. So instead we would look to keep only informative features and discard features which occur equally across both categories (less informative).

For example, out of 100 training samples provided, 50 belong to Sports category and remaining 50 to Entertainment. The word "coach" occurs in 40 out of 50 Sports examples, whereas it occurs in only 5 out of 50 examples in Entertainment. To compute the mutual information between the word "coach" and the category Sports, we compute :

 MI = P(X_1, Y_1) * log_2 \frac{P(X_1, Y_1)}{P(X_1)P(Y_1)}+P(X_1, Y_2) * log_2 \frac{P(X_1, Y_2)}{P(X_1)P(Y_2)} +P(X_2, Y_1) * log_2 \frac{P(X_2, Y_1)}{P(X_2)P(Y_1)}+P(X_2, Y_2) * log_2 \frac{P(X_2, Y_2)}{P(X_2)P(Y_2)}

Where X1 is the event that the word "coach" occurs in a document, X2 is the event that the word "coach" do not occur in a document, Y1 is the event that the document belongs to Sports category and Y2 is the event that the document do not belong to the Sports category. From the given data, we have the following :

 P(X_1) = 0.45,P(X_2) = 0.55,P(Y_1) = 0.50,P(Y_2) = 0.50, P(X_1, Y_1) = 0.40, P(X_1, Y_2) = 0.05, P(X_2, Y_1) = 0.10, P(X_2, Y_2) = 0.45

Thus the mutual information = 0.397.

Similarly, the word "audience" occurs in 31 of 50 Sports articles and 35 out of 50 Entertainment articles.

 P(X_1) = 0.66,P(X_2) = 0.34,P(Y_1) = 0.50,P(Y_2) = 0.50, P(X_1, Y_1) = 0.31, P(X_1, Y_2) = 0.35, P(X_2, Y_1) = 0.19, P(X_2, Y_2) = 0.15

Substituting the values back, we get the mutual information to be 0.005. Much less than the mutual information of the word "coach" with Sports category.

To decide whether a feature F should be used in the classifier, we could take either the maximum of the mutual information of that feature with every category or we could compute the mutual information weighted by the probability of each class and then based on some threshold decide to keep it or discard it. In our implementation, we choose the top K features with the maximum MI value for each class and then take their union.

Below is a C++ code written to compute the mutual information of each feature with each class and return the top K (K=150 in this case) features for each class with the highest MI values. Assuming a sparse document feature matrix and a vector of class labels for each document, the input is a triplet of rows, cols and values for a document feature matrix, with a vector of class labels for each document and a constant K i.e. number of features to select per class :

// [[Rcpp::export]]
std::set<int> cpp__mutualInformation(std::vector<int> rows, std::vector<int> cols,  
                                     std::vector<int> classes, int maxFeaturesPerClass) {
  return entropyMeasures(rows, cols, classes,
                         [](const double & classWithFeatureCount, const double & classCount, const double & featureCount, const double & totalDocs)
                           double classProb = classCount/totalDocs;
                           double featureProb = featureCount/totalDocs;
                           double classWithFeatureProb = classWithFeatureCount/totalDocs;
                           double noClassWithFeatureProb = (featureCount - classWithFeatureCount)/totalDocs;
                           double classWithoutFeatureProb = (classCount - classWithFeatureCount)/totalDocs;
                           double noClassWithoutFeatureProb = (totalDocs - featureCount - classCount + classWithFeatureCount)/totalDocs;
                           return classWithFeatureProb*log2(classWithFeatureProb/(classProb*featureProb))
                             + ((noClassWithFeatureProb != 0) ? noClassWithFeatureProb*log2(noClassWithFeatureProb/((1-classProb)*featureProb)) : 0)  +
                               + ((classWithoutFeatureProb != 0) ? classWithoutFeatureProb*log2(classWithoutFeatureProb/(classProb*(1-featureProb))) : 0) +
                               + ((noClassWithoutFeatureProb != 0) ? noClassWithoutFeatureProb*log2(noClassWithoutFeatureProb/((1-classProb)*(1-featureProb))) : 0);
                         }, maxFeaturesPerClass);

We have internally defined a generic feature selection method "entropyMeasures", that takes as input the rows, columns and values of the document feature matrix and also a lambda function to compute and return the desired output with the chosen feature selection technique. The advantage with this approach is that, since the inputs and outputs for feature selection remains the same and only the computation logic might change, thus reusing the function "entropyMeasures", we could just call this function with a different algorithm (such as chi-square).

Note that we have omitted the document term matrix values in the function definition, since we are assuming that we are working with a binary valued feature matrix.

std::set<int> entropyMeasures(std::vector<int> rows, std::vector<int> cols, std::vector<int> classes,
                              double (*func) (const double &, const double &, const double &, const double &),
                              const int &maxFeaturesPerClass) {
  std::set<int> myOutput;
  std::map<int, std::map<int, int>> fcCounts = featureClassCounts(rows, cols, classes);
  std::map<int, int> clCounts = classCounts(classes);
  std::map<int, int> fCounts = classCounts(cols);
  int totalDocs = 0;
  std::for_each(clCounts.begin(), clCounts.end(), [&totalDocs](std::pair<const int, int>& m){totalDocs += m.second;});
  std::map<int, std::priority_queue<std::pair<int, double>, std::vector<std::pair<int, double>>, comparator>> classMinHeaps;
  for (auto it = fCounts.begin(); it != fCounts.end(); ++it) {
    int featureIndex = it->first;
    int featureCount = it->second;
    std::map<int, int> p = fcCounts[featureIndex];
    for (auto it2 = p.begin(); it2 != p.end(); ++it2) {
      int cl = it2->first;
      int classWithFeatureCount = it2->second;
      int classCount = clCounts[cl];
      double value = func((double)classWithFeatureCount, (double)classCount, (double)featureCount, (double)totalDocs);
      if (classMinHeaps[cl].empty() || (int)classMinHeaps[cl].size() < maxFeaturesPerClass) classMinHeaps[cl].push(std::make_pair(featureIndex, value));
      else if (value > classMinHeaps[cl].top().second) {
        classMinHeaps[cl].push(std::make_pair(featureIndex, value));
  for (auto it = classMinHeaps.begin(); it != classMinHeaps.end(); ++it) {
    std::priority_queue<std::pair<int, double>, std::vector<std::pair<int, double>>, comparator> q = it->second;
    while(!q.empty()) {
      std::pair<int, double> tops =;
  return myOutput;

We have used min heaps per class to compute the top K most informative features for that class. Although this approach of selecting a constant number of features per class is un-supervised, but it helps to reduce the dimension of the document feature matrix to a great extent without much loss in feature information, for further processing with a classifier.


Tags: , ,