Stokastik

Machine Learning, AI and Programming

Building a Classification Tree from scratch

In this post I am going to demonstrate an implementation of a classification tree from scratch for multi-label classification. Since most of my work involves working with text classification, hence the classification tree that I am going to demonstrate here has been built, keeping text classification problems in mind. The theoretical explanation about various components and modules about classification tree can be found in this paper. This implementation is not ready-made for usage in production systems but only for demo purposes.

Classification Tree for Good/Bad Credit

The above diagram of a classification tree taken from the same paper referenced above, depicts how decision is made regarding whether a loan application is good or bad.

The algorithm is being implemented using C++11 with interfacing from R using RCPP library. The core ideas behind the implementation are

  • Construct the basic structure of the classification tree using C++ node struct pointers.
  • Split the nodes using Gini Impurity measure
  • Prune the large tree to avoid overfitting with cost complexity pruning,
  • Build the modules for testing the tree model with unseen data.
  • There are certain other details like converting the tree model into a R data-frame for export to R functions, handling missing values which we will look into as we start developing the codes.

To begin with, I am using lots of different hash-maps for caching intermediate data, input metadata etc. for faster computation. In order to avoid writing a lot of hash-maps (nested) in the code, I am using the following typedef conversions. Basically it means I can use much cleaner data types as aliases.

typedef std::unordered_map<int, std::unordered_map<double, std::set<int>>> Int_Double_SetInt;
typedef std::unordered_map<int, std::set<double>> Int_SetDouble;
typedef std::unordered_map<int, std::set<int>> Int_SetInt;
typedef std::unordered_map<int, std::vector<int>> Int_VecInt;
typedef std::unordered_map<int, std::unordered_map<int, double>> Int_Int_Double;
typedef std::unordered_map<int, double> Int_Double;
typedef std::unordered_map<int, int> Int_Int;
typedef std::unordered_map<int, std::vector<std::pair<int, double>>> Int_VecPairIntDouble;

Then define the basic building blocks of the tree, i.e. the nodes. For defining the Node data structure, I am using C++ struct pointers. Each node has the following attributes : If it is a non-leaf node, then each node will have a feature and the value of that feature on which a split is made for expanding the tree below, for e.g. X<=b.

Each node have the following additional set of attributes :

  • Indices for the training instances used for splitting the node.
  • Probability distribution of the class labels for the above training instances.
  • Pointer to the left and right subtrees.
  • There are some other attributes, that I have assigned to each node for the purpose of efficient computation of the tree, such as the depth of the node in the tree, number of leaves in the branch rooted at the node etc. which is not explicitly used for prediction purpose but useful for pruning the tree.
struct Split {
  int featureIndex;
  double featureDecisionVal;
  
  std::set<int> leftRows, rightRows;
};
struct Node {
  
  Split bestSplit;
  
  Int_Double classProbs;
  
  int depth, numLeavesBranch;
  double resubErrorBranch;
  
  std::set<int> rows;
  
  Node* left;
  Node* right;
};

Each split also stores within itself, the set of rows that are sent to the left subtree and the set of rows sent to the right subtree. For aggregating the training metadata, such as the class label for each training instances, weights of training instances, the set of unique values for each feature, etc., I am defining another struct for that purpose.

Note that, I am passing the input data for training purpose to the C++ module, in the form of a Sparse Matrix (Document Term Matrix), i.e. a triplet of (i, j, v) where i represents the row indices of training instances, j the feature indices and v the corresponding value at the position (i, j) in the matrix.

struct InputMetaData {
  Int_SetInt rowCols;
  Int_Int_Double rowColValMap;
  Int_Double instanceWeights;
  Int_Int rowClassMap;
  Int_SetInt colSparseRows;
  Int_Int_Double rowClassCounts;
  Int_VecInt colSortedRows;
  
  std::set<int> classLabels;
};

The purpose of each of the above variable are as follows :

  • "rowCols" - Map from each training instance to the set of feature indices.
  • "rowColValMap" - Map from each training instance, and feature pair (R, F) to the corresponding value in the matrix.
  • "instanceWeights" - Map from each training instance to its corresponding case weights.
  • "rowClassMap" - Map from each training instance to its class labels.
  • "colSparseRows" - Map from each feature index to the set of training instances for which the feature is missing from the input sparse matrix (because the count of the feature in the instance is 0).
  • "rowClassCounts" - Map from each training instance to the map of class probabilities for that instance.
  • "colSortedRows" - Map from each feature index to a vector of training instances containing the feature, sorted according to increasing feature values.
  • "classLabels" - Set of all unique class labels for training.

The following function is used for replicating or copying a Node object in memory. Note that it uses "deep copy" in order to copy the left and right sub-trees. This function is useful at the time of generating sequence of trees using cost-complexity pruning. Instead of creating new trees for every round, I re-use the already built sequence of trees.

Node* copyNode(Node* a) {
  Node* b = new Node();
  
  if (a == NULL) b = NULL;
  else {
    b->bestSplit = a->bestSplit;
    b->classProbs = a->classProbs;
    b->depth = a->depth;
    b->numLeavesBranch = a->numLeavesBranch;
    b->resubErrorBranch =  a->resubErrorBranch;
    b->rows =  a->rows;
    
    b->left = copyNode(a->left);
    b->right = copyNode(a->right);
  }
  
  return b;
}

The following function is used to compute the class probabilities for leaf nodes given the training instances. Note that the class probabilities for leaf nodes takes into account the instance weights.

Int_Double computeLeafClassProbs(const std::set<int> &rows, InputMetaData &metaData) {
  
  Int_Double classProbs;
  
  for (auto p = metaData.classLabels.begin(); p != metaData.classLabels.end(); ++p) classProbs[*p] = 0;
  
  double weightSum = std::accumulate(rows.begin(), rows.end(), 0.0, 
                                     [&metaData](double weightSum, const int &row){return weightSum + metaData.instanceWeights[row];});
  
  for (auto p = rows.begin(); p != rows.end(); ++p) classProbs[metaData.rowClassMap[*p]] += metaData.instanceWeights[*p]/weightSum;
  
  return classProbs;
}

The following function is used to compute the counts for all possible classes for a set of given training instances :

Int_Double classCountsForRows(const std::set<int> &rows, InputMetaData &metaData) {
  Int_Double clCounts;
  
  for (auto p = metaData.classLabels.begin(); p != metaData.classLabels.end(); ++p) clCounts[*p] = 0;
  
  for (auto p = rows.begin(); p != rows.end(); ++p) clCounts[metaData.rowClassMap[*p]]++;
  
  return clCounts;
}

The following function computes the number of rows from the class counts by summing over all the counts :

double classCountSum(Int_Double &clCounts) {
  double sum = std::accumulate(clCounts.begin(), clCounts.end(), 0.0, 
                               [](double sum, const std::pair<int, double> &a){return sum + a.second;});
  
  return sum;
}

The following function is used to compute the class probabilities for internal nodes.

Int_Double computeClassProbs(const std::set<int> &rows, InputMetaData &metaData) {
  
  Int_Double classCounts = classCountsForRows(rows, metaData);
  
  double numRows = classCountSum(classCounts);
  
  for (auto p = classCounts.begin(); p != classCounts.end(); ++p) classCounts[p->first] = (p->second)/(double)numRows;
  
  return classCounts;
}

The following two functions can be alternatively used for computing the impurity of a split. The first one is the Gini impurity and the second one is the Entropy measure.

double giniImpurity(Int_Double &classCounts) {
  
  Int_Double classProbs;
  
  double numRows = classCountSum(classCounts);
  
  for (auto p = classCounts.begin(); p != classCounts.end(); ++p) classProbs[p->first] = (p->second)/(double)numRows;
  
  double impurity = std::accumulate(classProbs.begin(), classProbs.end(), 0.0, 
                                    [](double impurity, const std::pair<int, double> &a){return impurity + a.second*(1-a.second);});
  
  return impurity;
}
double entropy(Int_Double &classCounts) {
  
  Int_Double classProbs;
  
  double numRows = classCountSum(classCounts);
  
  for (auto p = classCounts.begin(); p != classCounts.end(); ++p) classProbs[p->first] = (p->second)/(double)numRows;
  
  double impurity = std::accumulate(classProbs.begin(), classProbs.end(), 0.0, 
                                    [](double impurity, const std::pair<int, double> &a){return impurity - a.second*log2(a.second);});
  
  return impurity;
}

The below method is used for selecting the best class from the posterior class probabilities at a node.

std::pair<int, double> bestClass(Int_Double &classProbs) {
  
  auto it = std::max_element(classProbs.begin(), classProbs.end(), 
                             [](const std::pair<int, double> &a, const std::pair<int, double> &b){return a.second < b.second;});
  
  return *it;
}

The below method is called for creating a leaf node with the desired properties :

Node* getLeafNode(const std::set<int> &rows, InputMetaData &metaData) {
  
  Node *node = new Node();
  
  node->bestSplit = {-1, -1, std::set<int>(), std::set<int>()};
  
  node->classProbs = computeLeafClassProbs(rows, metaData);
  
  node->rows = rows;
  
  node->left = NULL;
  node->right = NULL;
  
  return node;
}

The next two functions are used for summing two class counts map and subtracting one class counts map from another respectively. These functions are helpful when iterating over the class counts of the possible rows in a node and computing the class counts for the left sub-rows and the right sub-rows with respect to a split point.

Int_Double addClassCounts(Int_Double &clCounts1, Int_Double &clCounts2, InputMetaData &metaData) {
  Int_Double clCounts;
  
  for (auto p = metaData.classLabels.begin(); p != metaData.classLabels.end(); ++p) clCounts[*p] = clCounts1[*p] + clCounts2[*p];
  
  return clCounts;
}
Int_Double subClassCounts(Int_Double &clCounts1, Int_Double &clCounts2, InputMetaData &metaData) {
  Int_Double clCounts;
  
  for (auto p = metaData.classLabels.begin(); p != metaData.classLabels.end(); ++p) clCounts[*p] = clCounts1[*p] - clCounts2[*p];
  
  return clCounts;
}

Next we move on to create a Node in the tree :

Node* createNode(const std::set<int> &currRows,
                 InputMetaData &metaData, double (*costFuntion) (Int_Double &),
                 const bool &maxDepthReached) {
  
  Node *node = new Node();
  
  Int_Double classProbs = computeClassProbs(currRows, metaData);
  std::pair<int, double> best = bestClass(classProbs);

  if (best.second >= 0.95 || maxDepthReached) node = getLeafNode(currRows, metaData);
  
  else {
    double maxFeatureGain = 0;
    double featureDecisionVal = -1;
    int featureIdx = -1;
    
    std::set<int> leftRows, rightRows, currCols;
    
    for (auto p = currRows.begin(); p != currRows.end(); ++p) currCols.insert(metaData.rowCols[*p].begin(), metaData.rowCols[*p].end());
    
    for (auto p = currCols.begin(); p != currCols.end(); ++p) {
      int feature = *p;
      
      std::set<int> sparseRows;
      
      std::set_intersection(currRows.begin(), currRows.end(), 
                            metaData.colSparseRows[feature].begin(), metaData.colSparseRows[feature].end(),
                            std::inserter(sparseRows, sparseRows.end()));
      
      Int_Double currRowsClassCounts = classCountsForRows(currRows, metaData);
      Int_Double sparseRowsClassCounts = classCountsForRows(sparseRows, metaData);
      
      double totalImpurity = costFuntion(currRowsClassCounts);
      
      std::vector<int> allSortedRows = metaData.colSortedRows[feature];
      std::vector<int> currSortedRows;
      
      for (size_t i = 0; i < allSortedRows.size(); i++) if (currRows.find(allSortedRows[i]) != currRows.end()) currSortedRows.push_back(allSortedRows[i]);
      
      double maxGain = 0;
      double decisionVal = -1;
      
      double maxVal = std::numeric_limits<double>::min();
      
      std::vector<int> runningRows, bestLeftRows;
      
      Int_Double lastLeftClassCounts;
      
      int start = (sparseRows.size() > 0) ? -1 : 0;
      int i = start;
      
      while(i < (int)currSortedRows.size()) {
        
        Int_Double ltClassCounts, rtClassCounts;
        double currVal;
        
        if (i == -1) {
          ltClassCounts = sparseRowsClassCounts;
          runningRows.insert(runningRows.end(), sparseRows.begin(), sparseRows.end());
          currVal = 0.0;
        } 
        else {
          ltClassCounts = addClassCounts(lastLeftClassCounts, metaData.rowClassCounts[currSortedRows[i]], metaData);
          runningRows.push_back(currSortedRows[i]);
          currVal = metaData.rowColValMap[currSortedRows[i]][feature];
        }
        
        rtClassCounts = subClassCounts(currRowsClassCounts, ltClassCounts, metaData);
        
        int leftNumRows = (int)runningRows.size();
        int rightNumRows = (int)currRows.size()-leftNumRows;
        
        double leftCost = costFuntion(ltClassCounts);
        double rightCost = costFuntion(rtClassCounts);
        
        double w1 = (double)leftNumRows/(double)currRows.size();
        double w2 = (double)rightNumRows/(double)currRows.size();
        
        double gain = totalImpurity-(w1*leftCost+w2*rightCost);
        
        if (gain > maxGain) {
          maxGain = gain;
          decisionVal = currVal;
          bestLeftRows = runningRows;
        }
        
        lastLeftClassCounts = ltClassCounts;
        
        if (maxVal < currVal) maxVal = currVal;
        i++;
      }
      
      if (maxGain > maxFeatureGain && decisionVal < maxVal) {
        maxFeatureGain = maxGain;
        featureDecisionVal = decisionVal;
        featureIdx = feature;
        
        std::set<int> l(bestLeftRows.begin(), bestLeftRows.end());
        leftRows = l;
      }
    }
    
    if (featureIdx != -1) {
      
      std::set_difference(currRows.begin(), currRows.end(), leftRows.begin(), leftRows.end(),
                          std::inserter(rightRows, rightRows.end()));
      
      node->bestSplit = {featureIdx, featureDecisionVal, leftRows, rightRows};
      
      node->rows = currRows;
      
      node->classProbs = computeClassProbs(currRows, metaData);
    }
    
    else node = getLeafNode(currRows, metaData);
  }
  
  return node;
}

The above function accepts the set of training instances that are to be used for deciding the split in the current node. If the best class probability in the current node is at-least 0.95 or have reached the maximum depth at this node, then make this a leaf node. Else iterate over the possible features and their possible values for the current set of training instances (to reduce the search space of feature and feature-value space) and decide the best split, for which the reduction in impurity is maximum.

Note that, in the input sparse document term matrix for the text documents with TF-IDF weighting, features which are absent in an instance are assigned the TF-IDF score of 0, this values appear as "missing" from the input sparse matrix, which in fact is not a "missing" value in some sense and thus to handle all instance-features with 0 TF-IDF scores, we are explicitly including them in the above calculation by using the "sparseRows" variable. All instances with feature values as 0 for the current split are sent to the left subtree by default.

The algorithm is pretty simple to understand :

  • Iterate over the possible features for the current set of instances in this node.
  • For each feature, iterate over the current instances sorted according to their feature values in increasing order.
  • For each feature value, extract the class distribution for the left subset of rows and the right subset of rows. To prevent re-computing these values again and again, we are keeping a running set of class distribution.
  • Based of the class distribution for left subset of rows and the right subset of rows for a feature value, compute the Gini impurity. Compute the reduction in Gini impurity based on the impurity for the current set of instances.
  • Store the feature index and the feature value for the maximum reduction in Gini impurity.

Construct the tree "Node" by "Node". There are multiple ways to construct the tree, the simplest of them being recursively calling the "createNode" method for the left and the right subtrees. But given that each recursion call uses its own copy of the function space and might quickly lead to stack overflow, if the tree becomes huge. We use a non-recursive approach by using Queue data structures to construct the tree as shown below :

Node* constructTree(const std::set<int> &rows,
                    InputMetaData &metaData, 
                    double (*costFuntion) (const std::set<int> &, InputMetaData &),
                    const int &maxDepth) {
  
  std::queue<Node*> nodeQ;
  
  Node *node = createNode(rows, metaData, costFuntion, false);
  node->depth = 0;
  
  Node* root = node;
  
  nodeQ.push(node);
  
  while(!nodeQ.empty()) {
    node = nodeQ.front();
    nodeQ.pop();
    
    if (node->bestSplit.featureDecisionVal != -1) {
      bool maxDepthReached = (node->depth == maxDepth-1);
      
      node->left = createNode(node->bestSplit.leftRows, metaData, costFuntion, maxDepthReached);
      node->right = createNode(node->bestSplit.rightRows, metaData, costFuntion, maxDepthReached);
      
      node->left->depth = node->depth + 1;
      node->right->depth = node->depth + 1;
      
      nodeQ.push(node->left);
      nodeQ.push(node->right);
    }
  }
  
  return root;
}

The next function, defines how to convert the Tree data structure into a R DataFrame for exporting the tree object to and from R.

DataFrame transformTreeIntoDF(Node* node, DataFrame &leafNodeClassProbs) {
  
  std::vector<int> nodeIndex, leftNodeIndex, rightNodeIndex, featureIndex, leafLabels, leafIndices;
  std::vector<double> featureDecisionVal, leafLabelProbs;
  
  std::queue<Node*> nodeQ;
  nodeQ.push(node);
  
  int index = 0;
  int lastIndex = 0;
  
  while (!nodeQ.empty()) {
    Node* n = nodeQ.front();
    
    nodeIndex.push_back(index);
    
    featureIndex.push_back(n->bestSplit.featureIndex);
    featureDecisionVal.push_back(n->bestSplit.featureDecisionVal);
    
    if (n->left != NULL) {
      leftNodeIndex.push_back(++lastIndex);
      nodeQ.push(n->left);
      
      rightNodeIndex.push_back(++lastIndex);
      nodeQ.push(n->right);
    }
    
    else {
      leftNodeIndex.push_back(-1);
      rightNodeIndex.push_back(-1);
      
      std::unordered_map<int, double> predProbs = n->classProbs;
      
      for (auto p = predProbs.begin(); p != predProbs.end(); ++p) {
        leafLabels.push_back(p->first);
        leafIndices.push_back(index);
        leafLabelProbs.push_back(p->second);
      }
    }
    
    nodeQ.pop();
    index++;
  }
  
  leafNodeClassProbs = DataFrame::create(_["LeafIndex"]=leafIndices, _["LeafLabel"]=leafLabels, _["LeafLabelProb"]=leafLabelProbs);
  
  return DataFrame::create(_["NodeIndex"]=nodeIndex, _["LeftNodeIndex"]=leftNodeIndex, 
                           _["RightNodeIndex"]=rightNodeIndex, 
                           _["FeatureIndex"]=featureIndex, 
                           _["FeatureDecisionVal"]=featureDecisionVal);
}

Note that I am not storing all the attributes of the Node structure in the output DataFrame. Basically we are "flattening" the tree, using an index for each Node pointer, and then each node index refers to its left and right node indices. The most important attributes are of the splits, i.e. the feature and and the feature value on which a split is made at the current node index.

Below is a sample tree model converted to R data-frame :

   NodeIndex LeftNodeIndex RightNodeIndex FeatureIndex FeatureDecisionVal
1          0             1              2           11             0.1680
2          1             3              4            5             0.0794
3          2             5              6           17             0.5386
4          3            -1             -1           -1            -1.0000
5          4            -1             -1           -1            -1.0000
6          5            -1             -1           -1            -1.0000
7          6             7              8           20             0.6433
8          7            -1             -1           -1            -1.0000
9          8             9             10           43             0.1713
10         9            11             12           31             0.3698
11        10            -1             -1           -1            -1.0000
12        11            -1             -1           -1            -1.0000
13        12            -1             -1           -1            -1.0000

Next we define the function for doing prediction for a test instance. The following function recursively traverses the tree and decides the path (left or right) at each node based on the value of the feature. For example, if at a node the splitting condition is X <= 0.75 for a feature X, but for this particular instance X=0.69, then this instance will take the left subtree, else if X=0.97, it would have taken the right subtree and so on at all levels of the tree. For a tree with N nodes, the current instance would require log_2(N) number of comparisons.

If the feature that was used for splitting the current node is "missing" from the test instance, then there are techniques for using surrogate splits (which we have not implemented here). The idea behind surrogate splits is that, we consider all splits which are very similar to the best split at the node (similarity is based on the set of common training instances sent to the left and right subtrees) such that even if the feature for the best split is missing, we can look at each of the surrogate splits in the order of their decreasing similarity with the current split and if we find a surrogate split for which the feature used in splitting the node is also present in the test instance then we use this surrogate split instead of the best split.

Alternative split on gender for loan approval

Int_Double treePredict(Node* node, Int_Double &colValMap) {
  
  if (node->bestSplit.featureIndex == -1) return node->classProbs;
  
  else {
    int decisionFeature = node->bestSplit.featureIndex;
    double decisionVal = node->bestSplit.featureDecisionVal;
    
    if ((colValMap.find(decisionFeature) ==  colValMap.end()) 
          || colValMap[decisionFeature] <= decisionVal) return treePredict(node->left, colValMap);
    
    else return treePredict(node->right, colValMap);
  }
}

Basically up-to this step I have defined all the modules for constructing a fully grown classification tree. But there is a high probability that a fully grown classification tree would be overfitting to the training instances and cannot generalize well to testing instances. For example if there is a leaf node for each training instance, then the tree is highly overfitted and will perform poorly on unseen test instances. In order for the tree to generalize well to the test instances, we do pruning of the leaf nodes. There are multiple ways to prune a tree.

I will be using cost-complexity pruning algorithm, to prune this particular tree. In cost complexity pruning, we associate a cost to a subtree rooted at a node. For a tree T_t rooted at a node t, the cost associated is :

R(T_t)+{\alpha}|f(T_t)|

where R(T_t) is the re-substitution error for the subtree, which is equal to the sum of the re-substitution errors of the leaf nodes for this subtree. We define the re-substitution error of a node t, as :

R(t)=1-\text{max}P(C_k|t)

where \text{max}P(C_k|t) is the probability of the best class at the node t. For example, if for the internal node t, and for a binary classifier, the probability of class 0 is 0.85, then re-substitution error at this node is 0.15. It means that if we convert this node t into a leaf node then what will be the error rate.

R(T_t)=\sum_{t'{\epsilon}f(T_t)}R(t')

where f(T_t) are the set of leaf nodes for a tree rooted at node t. Thus |f(T_t)| defines the number of leaf nodes for the subtree rooted at node t. The parameter \alpha is the cost associated with the number of leaf nodes. As we said earlier that if we have high number of leaf nodes then the tree is probably overfitting, this parameter \alpha captures that. To achieve optimal cost, the cost of the tree rooted at node t should be equal to the cost if we convert the node t into a leaf node, i.e.

R(T_t)+{\alpha}|f(T_t)|=R(t)+\alpha

solving for \alpha, we have:

\alpha=\frac{R(t)-R(T_t)}{|f(T_t)|-1}

double resubstitutionErrorNode(Node* node, 
                               InputMetaData &metaData, int totalRows) {

  Int_Double classProbs = computeClassProbs(node->rows, metaData);
  std::pair<int, double> best = bestClass(classProbs);

  return (1-best.second)*(double)node->rows.size()/(double)totalRows;
}
Node* resubstitutionErrorBranches(Node* node, InputMetaData &metaData, int totalRows) {

  if (node->left == NULL) node->resubErrorBranch = resubstitutionErrorNode(node, metaData, totalRows);

  else {
    Node* n1 = resubstitutionErrorBranches(node->left, metaData, totalRows);
    Node* n2 = resubstitutionErrorBranches(node->right, metaData, totalRows);

    node->resubErrorBranch = n1->resubErrorBranch + n2->resubErrorBranch;
  }

  return node;
}

The above functions computes the re-substitution errors at a node as well as at a subtree rooted at a node.

Next define the function to compute the number of leaves for a subtree rooted at each node  t. This is pretty simple with recursion.

Node* nodeNumLeaves(Node* node) {

  if (node->left == NULL) node->numLeavesBranch = 1;
  else node->numLeavesBranch = nodeNumLeaves(node->left)->numLeavesBranch + nodeNumLeaves(node->right)->numLeavesBranch;

  return node;
}

Then I define the following functions for pruning the original tree using cross validation on the cost complexity pruned trees as specified in the paper mentioned at the beginning of the post.

std::pair<Node*, double> getMinComplexityNodes(Node* node, InputMetaData &metaData) {
  
  Node* n = node;
  std::queue<Node*> nodeQ;
  
  nodeQ.push(n);
  
  int totalRows = node->rows.size();
  
  double minComplexity = std::numeric_limits<double>::max();
  Node* minComplexityNode = new Node();
  
  while(!nodeQ.empty()) {
    Node* n = nodeQ.front();
    nodeQ.pop();
    
    double a = resubstitutionErrorNode(n, metaData, totalRows);
    double b = n->resubErrorBranch;
    int c = n->numLeavesBranch;
    
    double g = (a-b)/((double)c-1);
    
    if (g < minComplexity || (g == minComplexity && n->depth < minComplexityNode->depth)) {
      minComplexity = g;
      minComplexityNode = n;
    }
    
    if (n->left != NULL && n->left->left != NULL) nodeQ.push(n->left);
    if (n->right != NULL && n->right->left != NULL) nodeQ.push(n->right);
  }
  
  return std::make_pair(minComplexityNode, minComplexity);
}

The below function is used to search for a node in the entire tree and prune it. Two nodes are compared based on their split attributes since the same split attributes will not be used more than once. When we find the node to prune, we make it a leaf node by calling on the getLeafNode() method.

Node* pruneNode(Node* root, Node* node, InputMetaData &metaData) {
  if (node == NULL || root->left == NULL) return root;
  else if (root->bestSplit.featureIndex == node->bestSplit.featureIndex 
             && root->bestSplit.featureDecisionVal == node->bestSplit.featureDecisionVal) 
    return getLeafNode(root->rows, metaData);
  else {
    root->left = pruneNode(root->left, node, metaData);
    root->right = pruneNode(root->right, node, metaData);
    
    return root;
  }
}

The below function is used to generate the sequence of nodes to prune from the original tree (training data). Note that I am actually not storing the pruned subtrees into the output from the function, this is done in order not to consume too much memory as their could be lots of redundant data between sequence of trees where only one node differs.

std::vector<std::pair<Node*, double>> complexityPruneTreeSeq(Node* root, InputMetaData &metaData) {
  std::vector<std::pair<Node*, double>> output;
  
  root = resubstitutionErrorBranches(root, metaData, root->rows.size());
  root = nodeNumLeaves(root);
  
  output.push_back(std::make_pair((Node*)NULL, 0));
  
  while(root->left != NULL) {
    std::pair<Node*, double> minComplexityNodes = getMinComplexityNodes(root, metaData);
    
    root = pruneNode(root, minComplexityNodes.first, metaData);
    
    root = resubstitutionErrorBranches(root, metaData, root->rows.size());
    root = nodeNumLeaves(root);
    
    output.push_back(std::make_pair(minComplexityNodes.first, minComplexityNodes.second));
  }
  
  return output;
}

The below function is used to generate sequence of cost-complexity pruned subtrees on the cross validation training data. The sequence of pruned subtrees are computed based on the complexity values found from the above function. For example if two consecutive complexity values determined from the above function are [\alpha_k, \alpha_{k+1}], then the corresponding tree from the cross validation training data is the one for which the complexity value lies in the range [\alpha_k, \alpha_{k+1}).

std::vector<std::pair<Node*, int>> complexityPruneTreeSeqWithComplexities(Node* root, InputMetaData &metaData, 
                                                                          const std::vector<double> &complexities) {
  
  std::vector<std::pair<Node*, int>> output;
  
  root = resubstitutionErrorBranches(root, metaData, root->rows.size());
  root = nodeNumLeaves(root);
  
  output.push_back(std::make_pair((Node*)NULL, 0));
  
  int i = 0;
  
  while(root->left != NULL) {
    std::pair<Node*, double> minComplexityNodes = getMinComplexityNodes(root, metaData);
    
    root = pruneNode(root, minComplexityNodes.first, metaData);
    
    root = resubstitutionErrorBranches(root, metaData, root->rows.size());
    root = nodeNumLeaves(root);
    
    if (minComplexityNodes.second < complexities[i]) output.push_back(std::make_pair(minComplexityNodes.first, -1));
    else {
      while (i <= complexities.size()-1 && minComplexityNodes.second >= complexities[i]) i++;
      output.push_back(std::make_pair(minComplexityNodes.first, i));
    }
  }
  
  return output;
}
Node* costComplexityPrune(Node* node, InputMetaData &metaData, const int &numFoldsCV, const int &maxDepth) {
  
  std::vector<double> updatedComplexities;
  std::vector<std::pair<Node*, double>> treeSeq = complexityPruneTreeSeq(copyNode(node), metaData);
  
  for (int i = 0; i < treeSeq.size()-1; i++) updatedComplexities.push_back(sqrt(treeSeq[i].second*treeSeq[i+1].second));
  
  updatedComplexities.push_back(std::numeric_limits<double>::max());
  
  std::set<int> allRows = node->rows;
  
  std::vector<int> rowsVec(allRows.begin(), allRows.end());
  std::random_shuffle(rowsVec.begin(), rowsVec.end());
  
  int foldSize = std::ceil((double)allRows.size()/(double)numFoldsCV);
  
  Int_Double complexityErrorMap;
  
  for (int i = 1; i <= numFoldsCV; i++) {
    int start = (i-1)*foldSize;
    int end = i*foldSize;
    
    end = (end > rowsVec.size()) ? rowsVec.size():end;
    
    std::set<int> validationRows(rowsVec.begin()+start, rowsVec.begin()+end);
    std::set<int> trainRows;
    
    std::set_difference(allRows.begin(), allRows.end(), validationRows.begin(), validationRows.end(),
                        std::inserter(trainRows, trainRows.begin()));
    
    Node* tree = constructTree(trainRows, metaData, giniImpurity, maxDepth);
    
    std::vector<std::pair<Node*, int>> subTreeSeq = complexityPruneTreeSeqWithComplexities(copyNode(tree), metaData, updatedComplexities);
    
    for (int j = 0; j < subTreeSeq.size(); j++) {
      Node* nodeToPrune = subTreeSeq[j].first;
      
      tree = pruneNode(tree, nodeToPrune, metaData);
      
      if (subTreeSeq[j].second != -1) {
        double error = 0;
        
        for (auto q = validationRows.begin(); q != validationRows.end(); ++q) {
          Int_Double colValMap = metaData.rowColValMap[*q];
          Int_Double predClasses = treePredict(tree, colValMap);
          
          std::pair<int, double> out = bestClass(predClasses);
          
          if (out.first != metaData.rowClassMap[*q]) error++;
        }
        
        error = (double)error/(double)validationRows.size();
        
        complexityErrorMap[subTreeSeq[j].second] += error;
      }
    }
  }
  
  double minError = std::numeric_limits<double>::max();
  int complexityIndex = 0;
  
  for (auto p = complexityErrorMap.begin(); p != complexityErrorMap.end(); ++p) {
    if (p->second < minError) {
      minError = p->second;
      complexityIndex = p->first;
    }
  }
  
  for (int i = 0; i <= complexityIndex; i++) node = pruneNode(node, treeSeq[i].first, metaData);
  
  return node;
}

Finally I come to the main interface function that must be called from R engine for constructing the original tree and doing pruning. It accepts a data-frame of (i, j, v) triplet for the document term matrix of the training data, a vector of integer class labels, the maximum depth of the tree that we wish to go into and also an optional cross validation number of rounds for validating and selecting the best tree from the pruned subtrees.

I construct the metadata for the training data and store it in the "metaData" struct object. I am not yet passing the instance weights to the tree construction function but generating it within the function with equal weights for all instances. Since I am also returning the class probabilities for each leaf node of the final tree model, I am using a separate data-frame "leafNodeClassProbs" for the storing the <leaf node index, class label, class probability> triplet along with the model.

// [[Rcpp::export]]
List cpp__tree(DataFrame inputSparseMatrix, std::vector<int> classLabels, 
               int maxDepth=100, int cvRounds=5) {
  
  InputMetaData metaData;

  std::vector<DataFrame> trees, leafNodeClassProbs;
  std::vector<double> complexities;
  
  std::vector<int> rows = inputSparseMatrix["i"];
  std::vector<int> cols = inputSparseMatrix["j"];
  std::vector<double> vals = inputSparseMatrix["v"];
  
  std::set<int> uniqueRows(rows.begin(), rows.end());
  std::set<int> uniqueCols(cols.begin(), cols.end());
  std::set<int> uniqueLabels(classLabels.begin(), classLabels.end());
  
  metaData.classLabels = uniqueLabels;
  
  Int_SetInt colRows;
  Int_SetInt colSparseRows;
  
  Int_VecPairIntDouble colValRowPairsMap;
  
  for (size_t i = 0; i < rows.size(); i++) colValRowPairsMap[cols[i]].push_back(std::make_pair(rows[i], vals[i]));
  for (size_t i = 0; i < rows.size(); i++) metaData.rowCols[rows[i]].insert(cols[i]);
  for (size_t i = 0; i < rows.size(); i++) colRows[cols[i]].insert(rows[i]);
  
  for (size_t i = 0; i < rows.size(); i++) metaData.rowColValMap[rows[i]][cols[i]] = vals[i];
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) metaData.rowClassMap[*p] = classLabels[*p-1];
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) metaData.instanceWeights[*p]=1/(double)uniqueRows.size();
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
    for (auto q = uniqueLabels.begin(); q != uniqueLabels.end(); ++q) metaData.rowClassCounts[*p][*q] = 0;
    metaData.rowClassCounts[*p][metaData.rowClassMap[*p]] = 1;
  }
  
  for (auto p = colValRowPairsMap.begin(); p != colValRowPairsMap.end(); ++p) {
    std::vector<std::pair<int, double>> valRowsPairs = p->second;
    
    std::sort(valRowsPairs.begin(), valRowsPairs.end(), 
              [](const std::pair<int, double> &a, const std::pair<int, double> &b){return a.second < b.second;});
    
    for (size_t i = 0; i < valRowsPairs.size(); i++) metaData.colSortedRows[p->first].push_back(valRowsPairs[i].first);
  }
  
  colValRowPairsMap.clear();
  
  for (auto p = colRows.begin(); p != colRows.end(); ++p) {
    std::set<int> nonSparseRows = p->second;
    std::set<int> sparseRows;
    
    std::set_difference(uniqueRows.begin(), uniqueRows.end(), nonSparseRows.begin(), nonSparseRows.end(),
                        std::inserter(sparseRows, sparseRows.end()));
    
    metaData.colSparseRows[p->first] = sparseRows;
  }
  
  colRows.clear();
  
  Node* node=new Node();

  node = constructTree(uniqueRows, metaData, giniImpurity, maxDepth);

  Node* prunedNode = costComplexityPrune(node, metaData, cvRounds, maxDepth);

  DataFrame predClassProbs;

  trees.push_back(transformTreeIntoDF(prunedNode, predClassProbs));
  leafNodeClassProbs.push_back(predClassProbs);

  return List::create(_["Trees"]=trees, _["TreeClassProbs"]=leafNodeClassProbs);
}

The only part now remains is to implement the module for doing prediction on test instances, given that we have already generated the model and saved it into a R data-frame object. I need to pass this model and the leaf node class probabilities along with the set of testing instances and get back the predicted class probabilities from each testing instance. The below codes exactly does that for me.

struct NodeDF {
  int index, leftIndex, rightIndex, featureIndex;
  double decisionVal;
};
Int_Double dfPredict(std::unordered_map<int, NodeDF> &nodeDFMap,
                     Int_Double &colValMap,
                     Int_Int_Double &predClassProbs) {

  int index = 0;

  while(true) {
    if (nodeDFMap[index].featureIndex == -1) return predClassProbs[index];
    else {
      int decisionFeature = nodeDFMap[index].featureIndex;
      double decisionVal = nodeDFMap[index].decisionVal;

      if (colValMap.find(decisionFeature) == colValMap.end() 
          || colValMap[decisionFeature] <= decisionVal) index = nodeDFMap[index].leftIndex;
      else index = nodeDFMap[index].rightIndex;
    }
  }
}
// [[Rcpp::export]]
Int_Int_Double cpp__test(DataFrame inputSparseMatrix, List modelsMetaData) {

  Int_Int_Double output;

  std::vector<int> rows = inputSparseMatrix["i"];
  std::vector<int> cols = inputSparseMatrix["j"];
  std::vector<double> vals = inputSparseMatrix["v"];

  Int_Int_Double rowColValMap;

  for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i];

  std::vector<DataFrame> models = modelsMetaData["Trees"];
  std::vector<DataFrame> leafNodeClassProbs = modelsMetaData["TreeClassProbs"];

  DataFrame model = models[0];
  
  std::vector<int> nodeIndex = model["NodeIndex"];
  std::vector<int> leftNodeIndex = model["LeftNodeIndex"];
  std::vector<int> rightNodeIndex = model["RightNodeIndex"];
  std::vector<int> featureIndex = model["FeatureIndex"];
  std::vector<double> featureDecisionVal = model["FeatureDecisionVal"];

  DataFrame leafNodeClassProb = leafNodeClassProbs[0];
  
  std::vector<int> leafIndex = leafNodeClassProb["LeafIndex"];
  std::vector<int> leafLabel = leafNodeClassProb["LeafLabel"];
  std::vector<double> leafLabelProb = leafNodeClassProb["LeafLabelProb"];
  
  Int_Int_Double predClassProbs;
  
  for (size_t j = 0; j < leafIndex.size(); j++) predClassProbs[leafIndex[j]][leafLabel[j]] = leafLabelProb[j];
  
  std::unordered_map<int, NodeDF> nodeDFMap;
  
  for (size_t j = 0; j < nodeIndex.size(); j++) {
    NodeDF df;
    
    df.index = nodeIndex[j];
    df.decisionVal = featureDecisionVal[j];
    df.featureIndex = featureIndex[j];
    df.leftIndex = leftNodeIndex[j];
    df.rightIndex = rightNodeIndex[j];
    
    nodeDFMap[nodeIndex[j]] = df;
  }
  
  for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) {
    Int_Double probs = dfPredict(nodeDFMap, p->second, predClassProbs);
    output[p->first]=probs;
  }

  return output;
}

Basically, I am not converting back the output data-frame model back into a tree but using it directly in the C++ code to do prediction. Finally I return the set of class probabilities for each test instance.

There are some generic cases that I have omitted while construction of the classification tree. The most important of them being handling of categorical features. Unlike numerical features, there are no particular ordering of categorical features such as gender or hair color etc.

For a categorical feature with L number of distinct values, the possible number of splits is 2^{L-1}-1. If the possible values of this feature are denoted to be :

{b_1, b_2, ..., b_L}

Then for a particular instance, each one of the b_i's can be set(1) or unset(0), thus there can be 2^L different configurations for the feature for the above instance. But 2 of them are not useful for splitting, the ones in which all b_i=0 or all b_i=1. For splitting, we need to consider symmetry of the possible values too, i.e. {1, 1, 0, 1} is same as {0, 0, 1, 0}, thus the maximum possible number of splits is :

\frac{1}{2}[2^L-2]=2^{L-1}-1

But in case the possible values for this categorical feature are mutually exclusive such as gender, or nationality we need to consider only L different splits so it is much easier with mutually exclusive categories.

The way the model is trained from R function calls is shown below :

df.train <- data.frame("i"=train.dtm$i, "j"=train.dtm$j, "v"=train.dtm$v)
models <- cpp__tree(df.train, class.labels[train.idx], cvRounds = 3)

where "train.dtm" is a simple triplet matrix of the training document term matrix. Below is sample snapshot of how a simple triplet matrix looks like.

> str(train.dtm)
List of 6
 $ i       : int [1:8691] 101 108 90 118 82 60 10 28 56 33 ...
 $ j       : int [1:8691] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:8691] 0.0453 0.0762 0.0286 0.0519 0.0223 0.0039 0.0123 0.009 0.0124 0.0298 ...
 $ nrow    : int 145
 $ ncol    : int 60
 $ dimnames:List of 2
  ..$ : chr [1:145] "135" "53" "115" "81" ...
  ..$ : chr [1:60] "V1" "V2" "V3" "V4" ...
 - attr(*, "class")= chr "simple_triplet_matrix"

The prediction on test instances are done as follows :

df.test <- data.frame("i"=test.dtm$i, "j"=test.dtm$j, "v"=test.dtm$v)
preds <- cpp__test(df.test, models)
  
preds <- unlist(lapply(preds, function(x) as.integer(names(which(x == max(x))))))
mean(preds == class.labels[test.idx])

The "cpp__test" function returns the predicted class probabilities for each class and for each test instance. For computing the accuracy of the prediction, we extract the classes with the highest probabilities for each test instance and compare with the true labels of the test instances.

The entire code has been shared on my Github repository.

The classification tree can be easily extended to a multi-class Adaboost implementation only by modifying the tree construction function. In the multi-class Adaboost implementation, we are using the SAMME multi-class technique for extending binary Adaboost to multi-class. Following is the updated tree construction function :

// [[Rcpp::export]]
List cpp__adaBoostedTree(DataFrame inputSparseMatrix, std::vector<int> classLabels, 
                         int boostingRounds = 5, int maxDepth=100, int cvRounds=5) {
  
  std::vector<DataFrame> trees;
  std::vector<double> treeWeights;
  
  std::vector<DataFrame> leafNodeClassProbs;
  
  InputMetaData metaData;
  
  std::vector<int> rows = inputSparseMatrix["i"];
  std::vector<int> cols = inputSparseMatrix["j"];
  std::vector<double> vals = inputSparseMatrix["v"];
  
  std::set<int> uniqueRows(rows.begin(), rows.end());
  std::set<int> uniqueCols(cols.begin(), cols.end());
  std::set<int> uniqueLabels(classLabels.begin(), classLabels.end());
  
  metaData.classLabels = uniqueLabels;
  
  Int_SetInt colRows;
  Int_SetInt colSparseRows;
  
  Int_VecPairIntDouble colValRowPairsMap;
  
  for (size_t i = 0; i < rows.size(); i++) colValRowPairsMap[cols[i]].push_back(std::make_pair(rows[i], vals[i]));
  for (size_t i = 0; i < rows.size(); i++) metaData.rowCols[rows[i]].insert(cols[i]);
  for (size_t i = 0; i < rows.size(); i++) colRows[cols[i]].insert(rows[i]);
  
  for (size_t i = 0; i < rows.size(); i++) metaData.rowColValMap[rows[i]][cols[i]] = vals[i];
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) metaData.rowClassMap[*p] = classLabels[*p-1];
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) metaData.instanceWeights[*p]=1/(double)uniqueRows.size();
  
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
    for (auto q = uniqueLabels.begin(); q != uniqueLabels.end(); ++q) metaData.rowClassCounts[*p][*q] = 0;
    metaData.rowClassCounts[*p][metaData.rowClassMap[*p]] = 1;
  }
  
  for (auto p = colValRowPairsMap.begin(); p != colValRowPairsMap.end(); ++p) {
    std::vector<std::pair<int, double>> valRowsPairs = p->second;
    
    std::sort(valRowsPairs.begin(), valRowsPairs.end(), 
              [](const std::pair<int, double> &a, const std::pair<int, double> &b){return a.second < b.second;});
    
    for (size_t i = 0; i < valRowsPairs.size(); i++) metaData.colSortedRows[p->first].push_back(valRowsPairs[i].first);
  }
  
  colValRowPairsMap.clear();
  
  for (auto p = colRows.begin(); p != colRows.end(); ++p) {
    std::set<int> nonSparseRows = p->second;
    std::set<int> sparseRows;
    
    std::set_difference(uniqueRows.begin(), uniqueRows.end(), nonSparseRows.begin(), nonSparseRows.end(),
                        std::inserter(sparseRows, sparseRows.end()));
    
    metaData.colSparseRows[p->first] = sparseRows;
  }
  
  colRows.clear();
  
  int iterNum = 1;

  while(iterNum <= boostingRounds) {

    Node* tree = constructTree(uniqueRows, metaData, giniImpurity, maxDepth);

    tree = costComplexityPrune(tree, metaData, cvRounds, maxDepth);

    double err = 0;

    std::set<int> errorRows;

    for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
      std::unordered_map<int, double> predClassProbs = treePredict(tree, metaData.rowColValMap[*p]);
      std::pair<int, double> best = bestClass(predClassProbs);

      if (best.first != metaData.rowClassMap[*p]) {
        errorRows.insert(*p);
        err += metaData.instanceWeights[*p];
      }
    }
    
    double treeWt;
    
    if (err > 0.5 || err <= 0) {
      DataFrame predClassProbs;
      
      trees.push_back(transformTreeIntoDF(tree, predClassProbs));
      treeWeights.push_back(1.0);
      
      leafNodeClassProbs.push_back(predClassProbs);
      break;
    }
    else {
      treeWt = log((1-err)/err)+log(uniqueLabels.size()-1);
      
      DataFrame predClassProbs;
      
      trees.push_back(transformTreeIntoDF(tree, predClassProbs));
      treeWeights.push_back(treeWt);
      
      leafNodeClassProbs.push_back(predClassProbs);
      
      double sumWeights = 0;
      
      for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
        if (errorRows.find(*p) != errorRows.end()) metaData.instanceWeights[*p] *= exp(treeWt);
        sumWeights += metaData.instanceWeights[*p];
      }
      
      for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) metaData.instanceWeights[*p] /= sumWeights;
      
      iterNum++;
    }
  }

  double sumTreeWeights = 0;

  for (size_t i = 0; i < treeWeights.size(); i++) sumTreeWeights += treeWeights[i];
  for (size_t i = 0; i < treeWeights.size(); i++) treeWeights[i] /= sumTreeWeights;
  
  return List::create(_["Trees"]=trees, _["TreeWeights"]=treeWeights, _["TreeClassProbs"]=leafNodeClassProbs);
}

I am adding a "boostingRounds" parameter for the number of rounds of boosting.

To compute the complexity of constructing the tree, first we look into the complexity of each Node construction. In each node construction step, we iterate over all the features in the worst case and for each feature we iterate over the current set of training instances for a particular node. This step takes O(F*N_n) where F is the total number of features and N_n is the number of instances in this particular node 'n'.

Now for each instance, we add the class counts for this instance to the running count of the previous instances (sorted in increasing order of feature values). Thus each step takes O(K) where K is the total number of classes. Thus the time complexity at each node 'n' is :

O(F*K*N_n)

At each level of the tree, if we add the complexities, we would be getting O(F*K*N), where N is the total number of input training instances, since at each level, the union of instances in each node equals to the total number of instances. Given that we are going to a maximum depth of D in the tree, the total time complexity for the tree of max depth D is :

O(F*K*D*N)

If the value of D is not specified, the maximum depth the tree will go for a perfectly balanced tree is D=O(log_2(N)), whereas for a perfectly imbalanced tree, this would be D=O(N). The time taken to construct the tree could be improved by using parallelization at each level of the tree, since the nodes at each level are independently constructed from the nodes at the same level.

The codes are shared in my Github page.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , , , ,