Gradient Boosted Trees are tree ensemble algorithms similar to Random Forests, but unlike Random Forests where trees are constructed independently from each other, in the gradient boosting algorithm, the tree in each round is boosted based on the errors from the tree in the previous round. Although there are more differences in how GBT reduces the error (bias + variance) compared to RF. In this post, we would be constructing boosted trees using the CART (Classification and Regression Trees) as the base weak learner, which we will use for a binary classification problem.

In random forest, the final model comprises of uncorrelated trees, which independently predicts on each testing document and the final prediction is taken based on a majority vote, whereas in gradient boosted trees, each tree is constructed iteratively by optimizing an objective function of the tree from the previous round. The final prediction is taken to be the sum of prediction from each tree (regression tree), or majority vote (classification tree).

Given that a model is parameterized by , the objective function is defined to be :

where is the loss function for the model and is the regularization parameter. Our goal is to minimize the above objective by minimizing the loss (i.e. minimizing the error on the training data itself) and also keeping the model simple (i.e. generalize well to different datasets). Minimizing the objective helps us to achieve a low bias (minimizing loss, NO underfitting) and a low variance (simpler model, NO overfitting).

For example in linear regression, given that the true outputs for N training examples are and the outputs predicted by the regression model to be , where,

the squared loss error is given as :

and the regularization term is a function of the model parameters , generally considered to be either L1 or L2 regularization.

and

where 'd' is the dimensionality of the input variable or in other words number of features in the model. Thus the objective to be minimized with L2 regularization for linear regression is given to be :

Similarly in logistic regression, the objective would be given as :

where

One can use optimization technique like stochastic gradient descent, in order to minimize the above objective functions. In SGD, we iteratively compute the values of the model parameters till convergence, which gives us the optimum objective :

To obtain an output from an ensemble of trees, we will assume that the trees are regression trees, so that the above loss functions (squared loss and logistic loss) can be directly applied here too.

Let there be an ensemble of K trees, and the output from each tree 't' be given by , where 'x' is the input data. The final combined output of all trees is then given as :

In boosting, instead of growing K trees in parallel, we iteratively optimize the values of by adding a new function of regression tree in each round.

In each round, each new tree complements the already existing trees built in the previous round. This is what the essence of boosting is, i.e. to add a new classifier to an existing set of classifiers, so that the new classifier can better handle examples that were not handled correctly by the existing classifiers.

In the above representation, the final score for each of the 5 test instance is computed as the sum of scores from each tree. For the classification of whether an instance "Likes Computer Games", the boy receives a sum of scores 2.9 from both tree whereas the old man gets a sum of scores -1.9, indicating that the boy "Likes Computer Games" whereas the old man "Does not Like Computer Games".

The new function of a regression tree is added in a greedy manner such that the combined prediction of minimizes the objective function defined earlier. Thus we need to analyze which function to choose in each round such that the overall objective is minimized.

In a regression tree, each leaf node stores a score . Later we will see how we obtain these weighted scores. The function of a regression tree is the vector of leaf scores corresponding to each instance. For example, a tree built using 5 training instances labelled [1, 2, 3, 4, 5] and with 3 leaf nodes with the scores [2, 0.1, -1] and the arrangement of the 5 instances in 3 leaves as [{2}, {4}, {1,3,5}], then the tree is given by the vector :

because the instance 1 is in the 3rd leaf node with score -1, instance 2 is in the 1st leaf node with score 2 and so on. Thus we can write it as :

where the function maps a training instance to the index of the leaf node where it belongs, i.e. , , and so on.

The complexity of a tree in the objective to minimize is defined to be :

where T is the number of leaves in the tree and are the leaf scores defined above. This is the regularization term for the regression tree because high L2 norm of leaf scores and higher number of leaf nodes leads to possible overfitting (high variance) and hence we included this terms in our objective. The objective function for the regression tree in round 't' is thus :

where is the loss for the i-th training instance.

Substituting in the above equation, we get

Using the Taylor series expansion and by some approximation, we get the following :

where

and

Since is a constant number, removing it from the objective, we obtain the final form of objective that needs to be minimized :

From the above definition of , on substitution we get :

Grouping the term by the leaf nodes, we have :

where

Defining

and

Differentiating w.r.t. and setting them to 0 to obtain the minimum value of the objective function, we get :

or

are the optimum values of for which the objective is minimized. The value of minimal objective at is :

Note that the optimum values for the leaf weights and the minimum objective derived above is assuming a fixed tree with T number of leaf nodes. But our goal is to construct this tree in such a way that minimizes the objective function. Thus by a brute-force heuristic, one can enumerate all possible trees and for each possible tree, compute the optimal leaf weights and the minimum objective defined above. Then choose the tree with the least minimum objective. Another fast but sub-optimal method is to construct the tree greedily.

Its a greedy method because, each left and right sub-tree is constructed by minimizing the above mentioned objective function corresponding to that sub-tree.

Step 1 : For each leaf node, choose the best split. To choose the best split, for each feature from the training examples in the current node, sort the examples based on .

Step 2 : For each unique value 'a' of the feature , divide up the examples into two parts, examples with and examples with a" />.

Step 3 : The examples in the two sub-parts constitutes the two new leaf nodes. For example, if the left leaf node contains the example [1,4] and the right leaf node contains the examples [2,3,5], then we define :

and

and

where and are the first order and second order gradients as defined above. The change in objective due to splitting of one leaf node into two leaf nodes is the gain :

The term is loss due due to one additional leaf node.

Step 4 : For the feature , decide the split with the maximum gain.

Step 5 : Choose the feature and the split value for the current leaf node which gives the maximum gain among all features.

Step 6 : Recursively continue to split the leaf nodes at each level until we have reached either a maximum specified depth of the tree or any split leads to a zero or negative gain.

Step 7 : Stopping splits at negative gains, can prevent further splits with positive gains down the depths of the tree, hence instead of pre-stopping due to negative gain, prune the leaf nodes with negative gains after maximum depth.

Step 8 : The scores at each leaf node 'j' is given by the formula calculated above :

where and are decided for the leaf node j on the best split.

struct Node { int featureIndex; double featureDecisionVal, leafScore; Node* left; Node* right; };

We defined a C++ structure "Node" to represent the classification tree we are going to construct. Each node of the tree has the "featureIndex" which is used to split the current node into 2 nodes. If the current node is a leaf node, then featureIndex is -1, to identify that there is no further splitting. The value of the feature, "featureIndex" at which we split the current node is given by the attribute "featureDecisionVal", i.e. if the value of the feature "featureIndex" for any given example is less than equals to "featureDecisionVal", then we go towards the left subtree else we go towards the right subtree. The attribute "leafScore" denotes the scores in the leaves of the tree, i.e. in our earlier representations.

double grad(const double &trueVal, const double &predVal) { double prob = 1/(1+exp(-predVal)); return prob-trueVal; } double hessian(const double &trueVal, const double &predVal) { double prob = 1/(1+exp(-predVal)); return prob*(1-prob); }

Next we define the gradients and the Hessians for the logistic loss objective. In case of logistic loss, the loss is given as :

The gradient of the above loss w.r.t. , i.e.

, where

and similarly, the second derivative w.r.t. , i.e.

Node* getLeafNode(const std::set<int> &rows, std::unordered_map<int, double> &g, std::unordered_map<int, double> &h, const double &lambda) { Node *node = new Node(); node->featureIndex = -1; node->featureDecisionVal = -1; double G=0, H=0; for (auto p = rows.begin(); p != rows.end(); ++p) {G+=g[*p]; H+=h[*p];} node->leafScore = -G/(H+lambda); node->left = NULL; node->right = NULL; return node; }

The above function assigns the attributes for leaf nodes.The parameter "rows" denotes the examples from the training data which is falling into this leaf. "g" and "h" are pre-computed for each row of the training data.

In our approach to construct the binary classifier, we will assume that we are working with simple triplet matrices instead of full dense matrices. The advantage with simple triplet matrices is that it accounts for sparsity and thus makes the size of the training and testing matrices much smaller in size compared to full dense matrices. But the disadvantage is that we need to separately handle the missing values.

typedef std::unordered_map<int, std::unordered_map<double, std::set<int>>> DataFormat;

Node* constructTree(DataFormat &formattedData, std::unordered_map<int, std::set<double>> &colSortedVals, const std::set<int> &rows, std::unordered_map<int, double> &g, std::unordered_map<int, double> &h, const int &depth, const double &lambda, const double &gamma) { Node *node = new Node(); if (depth == 0) node = getLeafNode(rows, g, h, lambda); else { double maxFeatureGain = std::numeric_limits<double>::min(); double featureDecisionVal = -1; int featureIdx = -1; double sumG=0, sumH=0; for (auto p = rows.begin(); p != rows.end(); ++p) {sumG+=g[*p]; sumH+=h[*p];} std::set<int> leftRows, rightRows; DataFormat newFormattedData; for (auto p = colSortedVals.begin(); p != colSortedVals.end(); ++p) { int feature = p->first; std::set<double> sortedVals = p->second; for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) { std::set<int> valRows = formattedData[feature][*q]; std::set_intersection(rows.begin(), rows.end(), valRows.begin(), valRows.end(), std::inserter(newFormattedData[feature][*q], newFormattedData[feature][*q].end())); } } std::unordered_map<int, std::set<int>> sparseMissingRows; for (auto p = colSortedVals.begin(); p != colSortedVals.end(); ++p) { int feature = p->first; std::set<double> sortedVals = p->second; std::set<int> thisRows; for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) { std::set<int> valRows = newFormattedData[feature][*q]; thisRows.insert(valRows.begin(), valRows.end()); } std::set_difference(rows.begin(), rows.end(), thisRows.begin(), thisRows.end(), std::inserter(sparseMissingRows[feature], sparseMissingRows[feature].end())); std::unordered_map<double, double> sumGL, sumHL; double prevG = 0, prevH = 0; for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) { std::set<int> valRows = newFormattedData[feature][*q]; double sum1 = 0, sum2 = 0; for (auto r = valRows.begin(); r != valRows.end(); ++r) { sum1 += g[*r]; sum2 += h[*r]; } if (q == sortedVals.begin()) { for (auto r = sparseMissingRows[feature].begin(); r != sparseMissingRows[feature].end(); ++r) { sum1 += g[*r]; sum2 += h[*r]; } } sumGL[*q] = prevG + sum1; sumHL[*q] = prevH + sum2; prevG = sumGL[*q]; prevH = sumHL[*q]; } double maxGain = std::numeric_limits<double>::min(); double decisionVal = -1; for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) { double G_L = sumGL[*q]; double H_L = sumHL[*q]; double G_R = sumG-sumGL[*q]; double H_R = sumH-sumHL[*q]; double gain = 0.5*((G_L*G_L)/(H_L+lambda) + (G_R*G_R)/(H_R+lambda) - (sumG*sumG)/(sumH+lambda)) - gamma; if (gain > maxGain) { maxGain = gain; decisionVal = *q; } } if (maxGain > maxFeatureGain && decisionVal < *sortedVals.rbegin()) { maxFeatureGain = maxGain; featureDecisionVal = decisionVal; featureIdx = feature; } } if (featureIdx != -1) { std::set<double> sortedVals = colSortedVals[featureIdx]; leftRows.insert(sparseMissingRows[featureIdx].begin(), sparseMissingRows[featureIdx].end()); for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) { if (*q <= featureDecisionVal) leftRows.insert(newFormattedData[featureIdx][*q].begin(), newFormattedData[featureIdx][*q].end()); else rightRows.insert(newFormattedData[featureIdx][*q].begin(), newFormattedData[featureIdx][*q].end()); } node->featureIndex = featureIdx; node->featureDecisionVal = featureDecisionVal; node->leafScore = -1; node->left = constructTree(formattedData, colSortedVals, leftRows, g, h, depth-1, lambda, gamma); node->right = constructTree(formattedData, colSortedVals, rightRows, g, h, depth-1, lambda, gamma); } else node = getLeafNode(rows, g, h, lambda); } return node; }

The above recursive function is used to construct the tree level by level. The exit condition for the recursion is that we have reached the maximum specified depth of the tree, at which we construct a leaf node with all the remaining training rows.

The input parameter "formattedData" is a two-level map containing the set of training rows for each unique feature and unique value combination, i.e. if the feature index 1, with value of 0.87 is contained in training rows 3, 6 and 7, then the map is

formattedData[1][0.87]={3, 6, 7}

The parameter "colSortedVals" is a map containing the set of sorted values for each unique feature. The parameter "rows" denotes the training rows that needs to be considered only for the current node. If depth is non-zero, then we first create an alternative "newFormattedData" to subset only the training rows defined in the parameter "rows". The variable "sparseMissingRows" is a map that contains the set of all rows having missing values for a particular feature, i.e. if the feature with index 1 is not present in the rows 5, 8, and 10, then :

sparseMissingRows[1]={5, 8, 10}

By default, we assume that these missing rows have the minimum feature value from all unique values for this particular feature. Next, for each feature X, we use the sorted values map to compute the value of feature X, for which the gain on splitting the rows is maximum. Repeat the same with all features and find out the feature index and the value for which the gain is maximum. Split the node at this feature index and value and recursively construct the left and right subtrees.

double treePredict(Node* &node, std::unordered_map<int, double> &colValMap) { if (node->featureIndex == -1) return node->leafScore; else { int decisionFeature = node->featureIndex; double decisionVal = node->featureDecisionVal; if (colValMap[decisionFeature] <= decisionVal) return treePredict(node->left, colValMap); else return treePredict(node->right, colValMap); } }

The above "treePredict" function accepts the constructed tree and an example data as an input. The example is in the sparse data format and it is a map from feature to value. The function returns the score of the leaf where the decision path ends.

struct NodeDF { int index, leftIndex, rightIndex, featureIndex; double decisionVal, leafScore; };

DataFrame transformTreeIntoDF(Node* &node) { std::vector<int> nodeIndex, leftNodeIndex, rightNodeIndex, featureIndex; std::vector<double> featureDecisionVal, leafScore; std::queue<Node*> nodeQ; nodeQ.push(node); int index = 0; int lastIndex = 0; while (!nodeQ.empty()) { Node* n = nodeQ.front(); nodeIndex.push_back(index); featureIndex.push_back(n->featureIndex); featureDecisionVal.push_back(n->featureDecisionVal); leafScore.push_back(n->leafScore); if (n->left != NULL) { lastIndex++; leftNodeIndex.push_back(lastIndex); nodeQ.push(n->left); } else leftNodeIndex.push_back(-1); if (n->right != NULL) { lastIndex++; rightNodeIndex.push_back(lastIndex); nodeQ.push(n->right); } else rightNodeIndex.push_back(-1); nodeQ.pop(); index++; } return DataFrame::create(_["NodeIndex"]=nodeIndex, _["LeftNodeIndex"]=leftNodeIndex, _["RightNodeIndex"]=rightNodeIndex, _["FeatureIndex"]=featureIndex, _["FeatureDecisionVal"]=featureDecisionVal, _["Score"]=leafScore); }

// [[Rcpp::export]] std::vector<DataFrame> cpp__gbt(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals, std::vector<double> classLabels, int depth = 5, int nrounds = 10, double lambda = 1, double gamma = 0, double learningRate = 0.1) { std::unordered_map<int, std::set<double>> colSortedVals; DataFormat formattedData; std::set<int> uniqueRows(rows.begin(), rows.end()); for (size_t i = 0; i < rows.size(); i++) colSortedVals[cols[i]].insert(vals[i]); for (size_t i = 0; i < rows.size(); i++) formattedData[cols[i]][vals[i]].insert(rows[i]); std::unordered_map<int, std::unordered_map<int, double>> rowColValMap; for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i]; std::map<int, int> rowLabelMapTrue; std::map<int, double> rowLabelMapPred; for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) { rowLabelMapTrue[*p] = classLabels[*p-1]; rowLabelMapPred[*p] = 0; } Node* node; std::unordered_map<int, double> g, h; std::vector<double> costs; std::vector<DataFrame> treeDF; for (int i = 1; i <= nrounds; i++) { node = new Node(); for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) { g[*p]=learningRate*grad((double)rowLabelMapTrue[*p], rowLabelMapPred[*p]); h[*p]=learningRate*learningRate*hessian((double)rowLabelMapTrue[*p], rowLabelMapPred[*p]); } node = constructTree(formattedData, colSortedVals, uniqueRows, g, h, depth, lambda, gamma); treeDF.push_back(transformTreeIntoDF(node)); for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) rowLabelMapPred[*p] = treePredict(node, rowColValMap[*p]); } return treeDF; }

The above function is the main driver function for creating the gradient boosted tree model. The function accepts the input training data in simple triplet matrix format (represented by 3 vectors, rows, cols and vals) and the true class labels and the parameters of the learning algorithm. It constructs the "colSortedVals" and "formattedData" data structures and initializes the values of "g" and "h" (gradient and hessian) using an empty tree, i.e. all predictions set to 0. We use a learning rate parameter to compute the gradient and hessian so that we do not overfit the models.

Then for each round of the algorithm, it calls the "constructTree" function to construct a tree and then uses the latest tree model to do prediction on the entire training data by calling the "treePredict" function. With the latest predictions, the values of "g" and "h" are updated in the next round and so on. In each round, we transform the tree to a "DataFrame" structure native to R so that the model can be exported back to a R method call. For each round we have a different tree model, hence we return a vector of DataFrames from the driver function.

struct NodeDF { int index, leftIndex, rightIndex, featureIndex; double decisionVal, leafScore; };

Node* transformDFIntoTree(std::unordered_map<int, NodeDF> &nodeDFMap, const int &currIndex) { Node* node = new Node(); node->featureDecisionVal = nodeDFMap[currIndex].decisionVal; node->featureIndex = nodeDFMap[currIndex].featureIndex; node->leafScore = nodeDFMap[currIndex].leafScore; if (nodeDFMap[currIndex].featureIndex == -1) { node->left = NULL; node->right = NULL; } else { node->left = transformDFIntoTree(nodeDFMap, nodeDFMap[currIndex].leftIndex); node->right = transformDFIntoTree(nodeDFMap, nodeDFMap[currIndex].rightIndex); } return node; }

double dfPredict(std::unordered_map<int, NodeDF> &nodeDFMap, std::unordered_map<int, double> &colValMap) { int index = 0; double output = -1; while(true) { if (nodeDFMap[index].featureIndex == -1) { output = nodeDFMap[index].leafScore; break; } else { if (colValMap[nodeDFMap[index].featureIndex] <= nodeDFMap[index].decisionVal) index = nodeDFMap[index].leftIndex; else index = nodeDFMap[index].rightIndex; } } return output; }

// [[Rcpp::export]] std::map<int, double> cpp__test(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals, DataFrame model) { std::map<int, double> out; std::vector<int> nodeIndex = model["NodeIndex"]; std::vector<int> leftNodeIndex = model["LeftNodeIndex"]; std::vector<int> rightNodeIndex = model["RightNodeIndex"]; std::vector<int> featureIndex = model["FeatureIndex"]; std::vector<double> featureDecisionVal = model["FeatureDecisionVal"]; std::vector<double> leafScore = model["Score"]; std::unordered_map<int, std::unordered_map<int, double>> rowColValMap; for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i]; std::unordered_map<int, NodeDF> nodeDFMap; for (size_t i = 0; i < nodeIndex.size(); i++) { NodeDF df; df.index = nodeIndex[i]; df.decisionVal = featureDecisionVal[i]; df.featureIndex = featureIndex[i]; df.leafScore = leafScore[i]; df.leftIndex = leftNodeIndex[i]; df.rightIndex = rightNodeIndex[i]; nodeDFMap[nodeIndex[i]] = df; } //Node* node = transformDFIntoTree(nodeDFMap, nodeIndex[0]); for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) out[p->first] = dfPredict(nodeDFMap, p->second); return out; }

The above function is used to do prediction on testing data. The testing data is again accepted as triplet of vectors (rows, cols, vals) and we pass the trained model (one tree at a time). There are two ways of doing the prediction, one way is to construct the tree back from the model DataFrame by calling the function "transformDFIntoTree" and then calling the "treePredict" method or else we can directly use the DataFrame to predict by calling the "dfPredict" function. We prefer the later method of doing prediction as it is more straightforward.

The training and testing data used is from the "mlbench" package in R, from which we have used the "Sonar" dataset for predicting "Rock" or "Mines". We randomly sample and divide the data into training (70%) and testing (30%) and use the AUC metric as the performance measure.

library("mlbench") library("slam") data(Sonar) SonarData <- Sonar[,1:60] class.labels <- ifelse(Sonar$Class == "R", 0, 1) dtm <- as.simple_triplet_matrix(SonarData) train.idx <- sample(1:nrow(SonarData), 0.7*nrow(SonarData)) test.idx <- seq(1, nrow(SonarData))[-train.idx] train.dtm <- dtm[train.idx,] test.dtm <- dtm[test.idx,] models <- cpp__gbt(train.dtm$i, train.dtm$j, train.dtm$v, classLabels = class.labels[train.idx], depth =10, nrounds = 5, lambda = 1, gamma = 0, learningRate = 0.001) preds <- do.call(rbind, lapply(models, function(model) cpp__test(test.dtm$i, test.dtm$j, test.dtm$v, model)) preds <- colSums(preds) preds <- 1/(1+exp(-preds))

We train the model by creating 5 boosted trees with and and learning rate of 0.001. The depth of each tree is kept at 10, which is good enough for a training data with 60 features. For each tree we do a prediction on the entire testing data and then merge back the predictions into a dataframe. The final prediction is taken to be the sum of predictions from each tree model.

> models[[1]] NodeIndex LeftNodeIndex RightNodeIndex FeatureIndex FeatureDecisionVal Score 1 0 1 2 11 0.1970 -1.0000000000 2 1 3 4 4 0.0489 -1.0000000000 3 2 5 6 17 0.9382 -1.0000000000 4 3 7 8 1 0.0293 -1.0000000000 5 4 9 10 34 0.1615 -1.0000000000 6 5 11 12 30 0.1354 -1.0000000000 7 6 13 14 60 0.0078 -1.0000000000 8 7 15 16 28 0.9785 -1.0000000000 9 8 17 18 56 0.0073 -1.0000000000 10 9 -1 -1 -1 -1.0000 -0.0009999995 11 10 -1 -1 -1 -1.0000 0.0049999875 12 11 -1 -1 -1 -1.0000 -0.0014999989 13 12 19 20 39 0.0734 -1.0000000000 14 13 -1 -1 -1 -1.0000 -0.0034999939 15 14 -1 -1 -1 -1.0000 0.0004999999 16 15 -1 -1 -1 -1.0000 -0.0189998195 17 16 -1 -1 -1 -1.0000 0.0004999999 18 17 -1 -1 -1 -1.0000 -0.0004999999 19 18 -1 -1 -1 -1.0000 0.0014999989 20 19 -1 -1 -1 -1.0000 -0.0014999989 21 20 21 22 6 0.2431 -1.0000000000 22 21 23 24 12 0.0946 -1.0000000000 23 22 -1 -1 -1 -1.0000 -0.0009999995 24 23 25 26 60 0.0090 -1.0000000000 25 24 27 28 5 0.0067 -1.0000000000 26 25 -1 -1 -1 -1.0000 -0.0009999995 27 26 -1 -1 -1 -1.0000 0.0004999999 28 27 -1 -1 -1 -1.0000 -0.0004999999 29 28 29 30 6 0.0201 -1.0000000000 30 29 -1 -1 -1 -1.0000 -0.0004999999 31 30 31 32 41 0.6935 -1.0000000000 32 31 -1 -1 -1 -1.0000 0.0339994220 33 32 -1 -1 -1 -1.0000 -0.0004999999 >

The above is a snapshot of one of the tree models transformed into a DataFrame. All the rows with "FeatureIndex" equals to -1 are leaf nodes.

library(ROCR) pr <- prediction(preds, class.labels[test.idx]) prf <- performance(pr, measure = "tpr", x.measure = "fpr") auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] auc

Using the above "AUC" metric to measure the performance of the classifier, we obtained an AUC value of 0.760. With a bit of tweaking of the parameters we can obtain better performance, for example with nrounds=10 and learning rate of 0.01, we obtained an AUC value of 0.780 with the above data.

To compare, we have used the C5.0 decision tree package with number of trials (or adaptive boosting) set to 5.

library(C50) model <- C5.0(as.matrix(train.dtm), as.factor(class.labels[train.idx]), trials = 5) preds <- predict(model, as.matrix(test.dtm), type = "prob")[,"1"]

With the above adapative tree bosting model, we obtained an AUC value of 0.754. Both the above method and the adaptive boosting method is comparable in performance with the given dataset.

Theory is inspired from the following resources :

- https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
- https://arxiv.org/pdf/1603.02754v3.pdf
- https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
- https://xgboost.readthedocs.io/en/latest/

Categories: MACHINE LEARNING, PROGRAMMING

Tags: Boosting, C++11, CART, Gradient Boosted Trees, Gradient Boosting Machines, RCPP, REgression Trees, Regularization, XGBOOST