# Building Gradient Boosted Trees from scratch

Gradient Boosted Trees are tree ensemble algorithms similar to Random Forests, but unlike Random Forests where trees are constructed independently from each other, in the gradient boosting algorithm, the tree in each round is boosted based on the errors from the tree in the previous round. Although there are more differences in how GBT reduces the error (bias + variance) compared to RF. In this post, we would be constructing boosted trees using the CART (Classification and Regression Trees) as the base weak learner, which we will use for a binary classification problem. Regression Tree Prediction. (Source: http://xgboost.readthedocs.io/en/latest/model.html)

In random forest, the final model comprises of uncorrelated trees, which independently predicts on each testing document and the final prediction is taken based on a majority vote, whereas in gradient boosted trees, each tree is constructed iteratively by optimizing an objective function of the tree from the previous round. The final prediction is taken to be the sum of prediction from each tree (regression tree), or majority vote (classification tree).

Given that a model is parameterized by $\Theta$, the objective function is defined to be :

$Obj(\Theta)=L(\Theta) + \Omega(\Theta)$

where $L(\Theta)$ is the loss function for the model and $\Omega(\Theta)$ is the regularization parameter. Our goal is to minimize the above objective by minimizing the loss (i.e. minimizing the error on the training data itself) and also keeping the model simple (i.e. generalize well to different datasets). Minimizing the objective helps us to achieve a low bias (minimizing loss, NO underfitting) and a low variance (simpler model, NO overfitting).

For example in linear regression, given that the true outputs for N training examples are $y_i$ and the outputs predicted by the regression model to be $\widehat{y}_i$, where,

$\widehat{y}_i={\theta}_0+\sum_{j=1}^d{\theta}_jx_j$

the squared loss error is given as :

$L=\sum_{i=1}^N(y_i-\widehat{y}_i)^2$

and the regularization term is a function of the model parameters $\theta_j$, generally considered to be either L1 or L2 regularization.

$\Omega_{L1}=\sum_{j=1}^d\lambda||\theta_j||$ and $\Omega_{L2}=\sum_{j=1}^d\lambda||\theta_j||^2$

where 'd' is the dimensionality of the input variable or in other words number of features in the model. Thus the objective to be minimized with L2 regularization for linear regression is given to be :

$\text{Obj}(\Theta)=\sum_{i=1}^N(y_i-\widehat{y}_i)^2+ \sum_{j=1}^d\lambda||\theta_j||^2$

Similarly in logistic regression, the objective would be given as :

$\text{Obj}(\Theta)=\sum_{i=1}^N-[{y_i}*log(p_i) + (1-y_i)*log(1-p_i)]+\sum_{j=1}^d\lambda||\theta_j||^2$

where $p_i=\frac{1}{1+e^{-\widehat{y}_i}}$

One can use optimization technique like stochastic gradient descent, in order to minimize the above objective functions. In SGD, we iteratively compute the values of the model parameters $\theta_j$ till convergence, which gives us the optimum objective :

$\theta^{k+1}_j=\theta^k_j-\eta*[\frac{\partial}{{\partial}\theta_j}Obj(\Theta)]_{\theta^k_j}$

To obtain an output from an ensemble of trees, we will assume that the trees are regression trees, so that the above loss functions (squared loss and logistic loss) can be directly applied here too.

Let there be an ensemble of K trees, and the output from each tree 't' be given by $f^{(t)}(x)$, where 'x' is the input data. The final combined output of all trees is then given as :

$\widehat{y}_i=\sum_{t=1}^Kf^{(t)}(x_i)$

In boosting, instead of growing K trees in parallel, we iteratively optimize the values of $\widehat{y}_i$ by adding a new function of regression tree in each round.

$\widehat{y}^{(0)}_i=f^{(0)}(x_i)=0$

$\widehat{y}^{(1)}_i=f^{(0)}(x_i)+f^{(1)}(x_i)=\widehat{y}^{(0)}_i+f^{(1)}(x_i)$

$\widehat{y}^{(2)}_i=f^{(0)}(x_i)+f^{(1)}(x_i)+f^{(2)}(x_i)=\widehat{y}^{(2)}_i+f^{(2)}(x_i)$

$\widehat{y}^{(t)}_i=f^{(0)}(x_i)+f^{(1)}(x_i)+...+f^{(t)}(x_i)=\widehat{y}^{(t-1)}_i+f^{(t)}(x_i)$

In each round, each new tree complements the already existing trees built in the previous round. This is what the essence of boosting is, i.e. to add a new classifier to an existing set of classifiers, so that the new classifier can better handle examples that were not handled correctly by the existing classifiers. Prediction Score from ensemble of CART. (Source : http://xgboost.readthedocs.io/en/latest/model.html)

In the above representation, the final score for each of the 5 test instance is computed as the sum of scores from each tree. For the classification of whether an instance "Likes Computer Games", the boy receives a sum of scores 2.9 from both tree whereas the old man gets a sum of scores -1.9, indicating that the boy "Likes Computer Games" whereas the old man "Does not Like Computer Games".

The new function of a regression tree is added in a greedy manner such that the combined prediction of $\widehat{y}^{(t-1)}+f^{(t)}(x)$ minimizes the objective function defined earlier. Thus we need to analyze which function $f^{(t)}(x)$ to choose in each round such that the overall objective is minimized.

In a regression tree, each leaf node stores a score $w_j$. Later we will see how we obtain these weighted scores. The function $f^{(t)}(x)$ of a regression tree is the vector of leaf scores corresponding to each instance. For example, a tree built using 5 training instances labelled [1, 2, 3, 4, 5] and with 3 leaf nodes with the scores [2, 0.1, -1] and the arrangement of the 5 instances in 3 leaves as [{2}, {4}, {1,3,5}], then the tree is given by the vector :

$f^{(t)}(x=[1,2,3,4,5])=[-1, 2, -1, 0.1, -1]$

because the instance 1 is in the 3rd leaf node with score -1, instance 2 is in the 1st leaf node with score 2 and so on. Thus we can write it as :

$f^{(t)}(x_i)=w_{q(x_i)}$

where the function $q(x_i)$ maps a training instance to the index of the leaf node where it belongs, i.e. $q(x_1)=3$, $q(x_2)=1$, $q(x_3)=3$ and so on.

The complexity of a tree $\Omega(f^{(t)})$ in the objective to minimize is defined to be :

$\Omega(f^{(t)})={\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

where T is the number of leaves in the tree $f^{(t)}$ and $w_j$ are the leaf scores defined above. This is the regularization term for the regression tree because high L2 norm of leaf scores and higher number of leaf nodes leads to possible overfitting (high variance) and hence we included this terms in our objective. The objective function for the regression tree in round 't' is thus :

$\text{Obj}^{(t)}=\sum_{i=1}^NL(y_i, \widehat{y}^{(t)}_i)+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

where $L(y_i,\widehat{y}^{(t)}_i)$ is the loss for the i-th training instance.

Substituting $\widehat{y}^{(t)}_i=\widehat{y}^{(t-1)}_i+f^{(t)}(x_i)$ in the above equation, we get

$\text{Obj}^{(t)}=\sum_{i=1}^NL(y_i, \widehat{y}^{(t-1)}_i+f^{(t)}(x_i))+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

Using the Taylor series expansion and by some approximation, we get the following :

$\text{Obj}^{(t)}=\sum_{i=1}^N[L(y_i, \widehat{y}^{(t-1)}_i)+g_if^{(t)}(x_i)+\frac{1}{2}h_i(f^{(t)}(x_i))^2]+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

where

$g_i=\frac{{\partial}L(y_i,\widehat{y}^{(t-1)})}{{\partial}\widehat{y}^{(t-1)}}$ and $h_i=\frac{{\partial}^2L(y_i,\widehat{y}^{(t-1)})}{{\partial}\widehat{y}^{(t-1)}}$

Since $L(y_i, \hat{y}^{(t-1)}_i)$ is a constant number, removing it from the objective, we obtain the final form of objective that needs to be minimized :

$\text{Obj}^{(t)}=\sum_{i=1}^N[g_if^{(t)}(x_i)+\frac{1}{2}h_i(f^{(t)}(x_i))^2]+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

From the above definition of $f^{(t)}(x_i)=w_{q(x_i)}$, on substitution we get :

$\text{Obj}^{(t)}=\sum_{i=1}^N[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

Grouping the term $\sum_{i=1}^N[g_iw_{q(x_i)}+\frac{1}{2}h_iw^2_{q(x_i)}]$ by the leaf nodes, we have :

$\text{Obj}^{(t)}=\sum_{j=1}^T[(\sum_{i{\epsilon}I_j}g_i)*w_j+\frac{1}{2}(\sum_{i{\epsilon}I_j}h_i)w^2_j]+{\gamma}T+\frac{1}{2}{\lambda}\sum_{j=1}^Tw^2_j$

$\text{Obj}^{(t)}=\sum_{j=1}^T[(\sum_{i{\epsilon}I_j}g_i)*w_j+\frac{1}{2}(\sum_{i{\epsilon}I_j}h_i+\lambda)*w^2_j]+{\gamma}T$

where $I_j=\left\{i|q(x_i)=j\right\}$

Defining

$G_j=\sum_{i{\epsilon}I_j}g_i$ and $H_j=\sum_{i{\epsilon}I_j}h_i$

$\text{Obj}^{(t)}=\sum_{j=1}^T[G_j*w_j+\frac{1}{2}(H_j+\lambda)*w^2_j]+{\gamma}T$

Differentiating w.r.t. $w_j$ and setting them to 0 to obtain the minimum value of the objective function, we get :

$G_j+(H_j+\lambda)*w_j=0$ or $w^*_j=-\frac{G_j}{H_j+\lambda}$

$w^*_j$ are the optimum values of $w_j$ for which the objective is minimized. The value of minimal objective at $w^*_j$ is :

$\text{Obj}^{(t)}_{min}=-\frac{1}{2}\sum_{j=1}^T\frac{G_j^2}{H_j+\lambda}+{\gamma}T$

Note that the optimum values for the leaf weights and the minimum objective derived above is assuming a fixed tree with T number of leaf nodes. But our goal is to construct this tree in such a way that minimizes the objective function. Thus by a brute-force heuristic, one can enumerate all possible trees and for each possible tree, compute the optimal leaf weights and the minimum objective defined above. Then choose the tree with the least minimum objective. Another fast but sub-optimal method is to construct the tree greedily.

Its a greedy method because, each left and right sub-tree is constructed by minimizing the above mentioned objective function corresponding to that sub-tree.

Step 1 : For each leaf node, choose the best split. To choose the best split, for each feature $d_k$ from the training examples in the current node, sort the examples based on $d_k$.

Step 2 : For each unique value 'a' of the feature $d_k$, divide up the examples into two parts, examples with $d_k<=a$ and examples with $d_k>a$. Split based on the feature "Age" of the instances. (Source: http://xgboost.readthedocs.io/en/latest/model.html)

Step 3 : The examples in the two sub-parts constitutes the two new leaf nodes. For example, if the left leaf node contains the example [1,4] and the right leaf node contains the examples [2,3,5], then we define :

$G_L=g_1+g_4$ and $G_R=g_2+g_3+g_5$

$H_L=h_1+h_4$ and $H_R=h_2+h_3+h_5$

where $g_i$ and $h_i$ are the first order and second order gradients as defined above. The change in objective due to splitting of one leaf node into two leaf nodes is the gain :

$\text{Gain}=\frac{1}{2}[\frac{G^2_L}{H_L+\lambda}+\frac{G^2_R}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gamma$

The $-\gamma$ term is loss due due to one additional leaf node.

Step 4 : For the feature $d_k$, decide the split $a^*$ with the maximum gain.

Step 5 : Choose the feature and the split value for the current leaf node which gives the maximum gain among all features.

Step 6 : Recursively continue to split the leaf nodes at each level until we have reached either a maximum specified depth of the tree or any split leads to a zero or negative gain.

Step 7 : Stopping splits at negative gains, can prevent further splits with positive gains down the depths of the tree, hence instead of pre-stopping due to negative gain, prune the leaf nodes with negative gains after maximum depth.

Step 8 : The scores at each leaf node 'j' is given by the formula calculated above :

$w^*_j=-\frac{G_j}{H_j+\lambda}$

where $G_j$ and $H_j$ are decided for the leaf node j on the best split.

struct Node {
int featureIndex;
double featureDecisionVal, leafScore;

Node* left;
Node* right;
};

We defined a C++ structure "Node" to represent the classification tree we are going to construct. Each node of the tree has the "featureIndex" which is used to split the current node into 2 nodes. If the current node is a leaf node, then featureIndex is -1, to identify that there is no further splitting. The value of the feature, "featureIndex" at which we split the current node is given by the attribute "featureDecisionVal", i.e. if the value of the feature "featureIndex" for any given example is less than equals to "featureDecisionVal", then we go towards the left subtree else we go towards the right subtree. The attribute "leafScore" denotes the scores in the leaves of the tree, i.e. $w^*_j$ in our earlier representations.

double grad(const double &trueVal, const double &predVal) {
double prob = 1/(1+exp(-predVal));

return prob-trueVal;
}

double hessian(const double &trueVal, const double &predVal) {
double prob = 1/(1+exp(-predVal));

return prob*(1-prob);
}

Next we define the gradients $g_i$ and the Hessians $h_i$ for the logistic loss objective. In case of logistic loss, the loss is given as :

$L(y_i,\widehat{y}_i) = -[{y_i}*log(\frac{1}{1+e^{-\widehat{y}_i}}) + (1-y_i)*log(\frac{e^{-\widehat{y}_i}}{1+e^{-\widehat{y}_i}})]$

The gradient of the above loss w.r.t. $\widehat{y}$, i.e.

$\frac{{\partial}L(y_i,\widehat{y})}{{\partial}\widehat{y}}=p_i-y_i$, where $p_i=\frac{1}{1+e^{-\widehat{y}_i}}$

and similarly, the second derivative w.r.t. $\hat{y}$, i.e.

$\frac{{\partial}^2L(y_i,\hat{y})}{{\partial}\widehat{y}}=p_i*(1-p_i)$

Node* getLeafNode(const std::set<int> &rows,
std::unordered_map<int, double> &g, std::unordered_map<int, double> &h,
const double &lambda) {

Node *node = new Node();

node->featureIndex = -1;
node->featureDecisionVal = -1;

double G=0, H=0;

for (auto p = rows.begin(); p != rows.end(); ++p) {G+=g[*p]; H+=h[*p];}

node->leafScore = -G/(H+lambda);

node->left = NULL;
node->right = NULL;

return node;
}

The above function assigns the attributes for leaf nodes.The parameter "rows" denotes the examples from the training data which is falling into this leaf. "g" and "h" are pre-computed for each row of the training data.

In our approach to construct the binary classifier, we will assume that we are working with simple triplet matrices instead of full dense matrices. The advantage with simple triplet matrices is that it accounts for sparsity and thus makes the size of the training and testing matrices much smaller in size compared to full dense matrices. But the disadvantage is that we need to separately handle the missing values.

typedef std::unordered_map<int, std::unordered_map<double, std::set<int>>> DataFormat;
Node* constructTree(DataFormat &formattedData,
std::unordered_map<int, std::set<double>> &colSortedVals,
const std::set<int> &rows,
std::unordered_map<int, double> &g, std::unordered_map<int, double> &h,
const int &depth, const double &lambda, const double &gamma) {

Node *node = new Node();

if (depth == 0) node = getLeafNode(rows, g, h, lambda);

else {

double maxFeatureGain = std::numeric_limits<double>::min();
double featureDecisionVal = -1;
int featureIdx = -1;

double sumG=0, sumH=0;

for (auto p = rows.begin(); p != rows.end(); ++p) {sumG+=g[*p]; sumH+=h[*p];}

std::set<int> leftRows, rightRows;

DataFormat newFormattedData;

for (auto p = colSortedVals.begin(); p != colSortedVals.end(); ++p) {
int feature = p->first;
std::set<double> sortedVals = p->second;

for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) {
std::set<int> valRows = formattedData[feature][*q];

std::set_intersection(rows.begin(), rows.end(), valRows.begin(), valRows.end(),
std::inserter(newFormattedData[feature][*q], newFormattedData[feature][*q].end()));
}
}

std::unordered_map<int, std::set<int>> sparseMissingRows;

for (auto p = colSortedVals.begin(); p != colSortedVals.end(); ++p) {

int feature = p->first;
std::set<double> sortedVals = p->second;

std::set<int> thisRows;

for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) {
std::set<int> valRows = newFormattedData[feature][*q];
thisRows.insert(valRows.begin(), valRows.end());
}

std::set_difference(rows.begin(), rows.end(), thisRows.begin(), thisRows.end(),
std::inserter(sparseMissingRows[feature], sparseMissingRows[feature].end()));

std::unordered_map<double, double> sumGL, sumHL;

double prevG = 0, prevH = 0;

for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) {
std::set<int> valRows = newFormattedData[feature][*q];

double sum1 = 0, sum2 = 0;

for (auto r = valRows.begin(); r != valRows.end(); ++r) {
sum1 += g[*r];
sum2 += h[*r];
}

if (q == sortedVals.begin()) {
for (auto r = sparseMissingRows[feature].begin(); r != sparseMissingRows[feature].end(); ++r) {
sum1 += g[*r];
sum2 += h[*r];
}
}

sumGL[*q] = prevG + sum1;
sumHL[*q] = prevH + sum2;

prevG = sumGL[*q];
prevH = sumHL[*q];
}

double maxGain = std::numeric_limits<double>::min();
double decisionVal = -1;

for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) {

double G_L = sumGL[*q];
double H_L = sumHL[*q];
double G_R = sumG-sumGL[*q];
double H_R = sumH-sumHL[*q];

double gain = 0.5*((G_L*G_L)/(H_L+lambda) + (G_R*G_R)/(H_R+lambda) - (sumG*sumG)/(sumH+lambda)) - gamma;

if (gain > maxGain) {
maxGain = gain;
decisionVal = *q;
}
}

if (maxGain > maxFeatureGain && decisionVal < *sortedVals.rbegin()) {
maxFeatureGain = maxGain;
featureDecisionVal = decisionVal;
featureIdx = feature;
}
}

if (featureIdx != -1) {

std::set<double> sortedVals = colSortedVals[featureIdx];
leftRows.insert(sparseMissingRows[featureIdx].begin(), sparseMissingRows[featureIdx].end());

for (auto q = sortedVals.begin(); q != sortedVals.end(); ++q) {
if (*q <= featureDecisionVal) leftRows.insert(newFormattedData[featureIdx][*q].begin(), newFormattedData[featureIdx][*q].end());

else rightRows.insert(newFormattedData[featureIdx][*q].begin(), newFormattedData[featureIdx][*q].end());
}

node->featureIndex = featureIdx;
node->featureDecisionVal = featureDecisionVal;

node->leafScore = -1;

node->left = constructTree(formattedData, colSortedVals, leftRows, g, h, depth-1, lambda, gamma);
node->right = constructTree(formattedData, colSortedVals, rightRows, g, h, depth-1, lambda, gamma);
}

else node = getLeafNode(rows, g, h, lambda);
}

return node;
}

The above recursive function is used to construct the tree level by level. The exit condition for the recursion is that we have reached the maximum specified depth of the tree, at which we construct a leaf node with all the remaining training rows.

The input parameter "formattedData" is a two-level map containing the set of training rows for each unique feature and unique value combination, i.e. if the feature index 1, with value of 0.87 is contained in training rows 3, 6 and 7, then the map is

formattedData[0.87]={3, 6, 7}

The parameter "colSortedVals" is a map containing the set of sorted values for each unique feature. The parameter "rows" denotes the training rows that needs to be considered only for the current node. If depth is non-zero, then we first create an alternative "newFormattedData" to subset only the training rows defined in the parameter "rows". The variable "sparseMissingRows" is a map that contains the set of all rows having missing values for a particular feature, i.e. if the feature with index 1 is not present in the rows 5, 8, and 10, then :

sparseMissingRows={5, 8, 10}

By default, we assume that these missing rows have the minimum feature value from all unique values for this particular feature. Next, for each feature X, we use the sorted values map to compute the value of feature X, for which the gain on splitting the rows is maximum. Repeat the same with all features and find out the feature index and the value for which the gain is maximum. Split the node at this feature index and value and recursively construct the left and right subtrees.

double treePredict(Node* &node, std::unordered_map<int, double> &colValMap) {
if (node->featureIndex == -1) return node->leafScore;
else {
int decisionFeature = node->featureIndex;
double decisionVal = node->featureDecisionVal;

if (colValMap[decisionFeature] <= decisionVal) return treePredict(node->left, colValMap);
else return treePredict(node->right, colValMap);
}
}

The above "treePredict" function accepts the constructed tree and an example data as an input. The example is in the sparse data format and it is a map from feature to value. The function returns the score of the leaf where the decision path ends.

struct NodeDF {
int index, leftIndex, rightIndex, featureIndex;
double decisionVal, leafScore;
};
DataFrame transformTreeIntoDF(Node* &node) {

std::vector<int> nodeIndex, leftNodeIndex, rightNodeIndex, featureIndex;
std::vector<double> featureDecisionVal, leafScore;

std::queue<Node*> nodeQ;
nodeQ.push(node);

int index = 0;
int lastIndex = 0;

while (!nodeQ.empty()) {
Node* n = nodeQ.front();

nodeIndex.push_back(index);
featureIndex.push_back(n->featureIndex);
featureDecisionVal.push_back(n->featureDecisionVal);
leafScore.push_back(n->leafScore);

if (n->left != NULL) {
lastIndex++;
leftNodeIndex.push_back(lastIndex);
nodeQ.push(n->left);
}
else leftNodeIndex.push_back(-1);

if (n->right != NULL) {
lastIndex++;
rightNodeIndex.push_back(lastIndex);
nodeQ.push(n->right);
}
else rightNodeIndex.push_back(-1);

nodeQ.pop();
index++;
}

return DataFrame::create(_["NodeIndex"]=nodeIndex, _["LeftNodeIndex"]=leftNodeIndex,
_["RightNodeIndex"]=rightNodeIndex, _["FeatureIndex"]=featureIndex,
_["FeatureDecisionVal"]=featureDecisionVal, _["Score"]=leafScore);
}
// [[Rcpp::export]]
std::vector<DataFrame> cpp__gbt(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals,
std::vector<double> classLabels,
int depth = 5, int nrounds = 10, double lambda = 1,
double gamma = 0, double learningRate = 0.1) {

std::unordered_map<int, std::set<double>> colSortedVals;
DataFormat formattedData;

std::set<int> uniqueRows(rows.begin(), rows.end());

for (size_t i = 0; i < rows.size(); i++) colSortedVals[cols[i]].insert(vals[i]);
for (size_t i = 0; i < rows.size(); i++) formattedData[cols[i]][vals[i]].insert(rows[i]);

std::unordered_map<int, std::unordered_map<int, double>> rowColValMap;

for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i];

std::map<int, int> rowLabelMapTrue;
std::map<int, double> rowLabelMapPred;

for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
rowLabelMapTrue[*p] = classLabels[*p-1];
rowLabelMapPred[*p] = 0;
}

Node* node;

std::unordered_map<int, double> g, h;
std::vector<double> costs;

std::vector<DataFrame> treeDF;

for (int i = 1; i <= nrounds; i++) {

node = new Node();

for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) {
g[*p]=learningRate*grad((double)rowLabelMapTrue[*p], rowLabelMapPred[*p]);
h[*p]=learningRate*learningRate*hessian((double)rowLabelMapTrue[*p], rowLabelMapPred[*p]);
}

node = constructTree(formattedData, colSortedVals, uniqueRows, g, h, depth, lambda, gamma);
treeDF.push_back(transformTreeIntoDF(node));

for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) rowLabelMapPred[*p] = treePredict(node, rowColValMap[*p]);
}

return treeDF;
}

The above function is the main driver function for creating the gradient boosted tree model. The function accepts the input training data in simple triplet matrix format (represented by 3 vectors, rows, cols and vals) and the true class labels and the parameters of the learning algorithm. It constructs the "colSortedVals" and "formattedData" data structures and initializes the values of "g" and "h" (gradient and hessian) using an empty tree, i.e. all predictions set to 0. We use a learning rate parameter to compute the gradient and hessian so that we do not overfit the models.

Then for each round of the algorithm, it calls the "constructTree" function to construct a tree and then uses the latest tree model to do prediction on the entire training data by calling the "treePredict" function. With the latest predictions, the values of "g" and "h" are updated in the next round and so on. In each round, we transform the tree to a "DataFrame" structure native to R so that the model can be exported back to a R method call. For each round we have a different tree model, hence we return a vector of DataFrames from the driver function.

struct NodeDF {
int index, leftIndex, rightIndex, featureIndex;
double decisionVal, leafScore;
};
Node* transformDFIntoTree(std::unordered_map<int, NodeDF> &nodeDFMap,
const int &currIndex) {

Node* node = new Node();

node->featureDecisionVal = nodeDFMap[currIndex].decisionVal;
node->featureIndex = nodeDFMap[currIndex].featureIndex;
node->leafScore = nodeDFMap[currIndex].leafScore;

if (nodeDFMap[currIndex].featureIndex == -1) {
node->left = NULL;
node->right = NULL;
}
else {
node->left = transformDFIntoTree(nodeDFMap, nodeDFMap[currIndex].leftIndex);
node->right = transformDFIntoTree(nodeDFMap, nodeDFMap[currIndex].rightIndex);
}

return node;
}
double dfPredict(std::unordered_map<int, NodeDF> &nodeDFMap,
std::unordered_map<int, double> &colValMap) {

int index = 0;
double output = -1;

while(true) {
if (nodeDFMap[index].featureIndex == -1) {
output = nodeDFMap[index].leafScore;
break;
}
else {
if (colValMap[nodeDFMap[index].featureIndex] <= nodeDFMap[index].decisionVal) index = nodeDFMap[index].leftIndex;
else index = nodeDFMap[index].rightIndex;
}
}

return output;
}
// [[Rcpp::export]]
std::map<int, double> cpp__test(std::vector<int> rows, std::vector<int> cols, std::vector<double> vals,
DataFrame model) {

std::map<int, double> out;

std::vector<int> nodeIndex = model["NodeIndex"];
std::vector<int> leftNodeIndex = model["LeftNodeIndex"];
std::vector<int> rightNodeIndex = model["RightNodeIndex"];
std::vector<int> featureIndex = model["FeatureIndex"];

std::vector<double> featureDecisionVal = model["FeatureDecisionVal"];
std::vector<double> leafScore = model["Score"];

std::unordered_map<int, std::unordered_map<int, double>> rowColValMap;

for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i];

std::unordered_map<int, NodeDF> nodeDFMap;

for (size_t i = 0; i < nodeIndex.size(); i++) {
NodeDF df;

df.index = nodeIndex[i];
df.decisionVal = featureDecisionVal[i];
df.featureIndex = featureIndex[i];
df.leafScore = leafScore[i];
df.leftIndex = leftNodeIndex[i];
df.rightIndex = rightNodeIndex[i];

nodeDFMap[nodeIndex[i]] = df;
}

//Node* node = transformDFIntoTree(nodeDFMap, nodeIndex);

for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) out[p->first] = dfPredict(nodeDFMap, p->second);

return out;
}

The above function is used to do prediction on testing data. The testing data is again accepted as triplet of vectors (rows, cols, vals) and we pass the trained model (one tree at a time). There are two ways of doing the prediction, one way is to construct the tree back from the model DataFrame by calling the function "transformDFIntoTree" and then calling the "treePredict" method or else we can directly use the DataFrame to predict by calling the "dfPredict" function. We prefer the later method of doing prediction as it is more straightforward.

The training and testing data used is from the "mlbench" package in R, from which we have used the "Sonar" dataset for predicting "Rock" or "Mines". We randomly sample and divide the data into training (70%) and testing (30%) and use the AUC metric as the performance measure.

library("mlbench")
library("slam")

data(Sonar)

SonarData <- Sonar[,1:60]

class.labels <- ifelse(Sonar$Class == "R", 0, 1) dtm <- as.simple_triplet_matrix(SonarData) train.idx <- sample(1:nrow(SonarData), 0.7*nrow(SonarData)) test.idx <- seq(1, nrow(SonarData))[-train.idx] train.dtm <- dtm[train.idx,] test.dtm <- dtm[test.idx,] models <- cpp__gbt(train.dtm$i, train.dtm$j, train.dtm$v, classLabels = class.labels[train.idx],
depth =10, nrounds = 5, lambda = 1, gamma = 0, learningRate = 0.001)

preds <- do.call(rbind, lapply(models, function(model) cpp__test(test.dtm$i, test.dtm$j, test.dtm\$v, model))

preds <- colSums(preds)

preds <- 1/(1+exp(-preds))


We train the model by creating 5 boosted trees with $\lambda=1$ and $\gamma=0$ and learning rate of 0.001. The depth of each tree is kept at 10, which is good enough for a training data with 60 features. For each tree we do a prediction on the entire testing data and then merge back the predictions into a dataframe. The final prediction is taken to be  the sum of predictions from each tree model.

> models[]
NodeIndex LeftNodeIndex RightNodeIndex FeatureIndex FeatureDecisionVal         Score
1          0             1              2           11             0.1970 -1.0000000000
2          1             3              4            4             0.0489 -1.0000000000
3          2             5              6           17             0.9382 -1.0000000000
4          3             7              8            1             0.0293 -1.0000000000
5          4             9             10           34             0.1615 -1.0000000000
6          5            11             12           30             0.1354 -1.0000000000
7          6            13             14           60             0.0078 -1.0000000000
8          7            15             16           28             0.9785 -1.0000000000
9          8            17             18           56             0.0073 -1.0000000000
10         9            -1             -1           -1            -1.0000 -0.0009999995
11        10            -1             -1           -1            -1.0000  0.0049999875
12        11            -1             -1           -1            -1.0000 -0.0014999989
13        12            19             20           39             0.0734 -1.0000000000
14        13            -1             -1           -1            -1.0000 -0.0034999939
15        14            -1             -1           -1            -1.0000  0.0004999999
16        15            -1             -1           -1            -1.0000 -0.0189998195
17        16            -1             -1           -1            -1.0000  0.0004999999
18        17            -1             -1           -1            -1.0000 -0.0004999999
19        18            -1             -1           -1            -1.0000  0.0014999989
20        19            -1             -1           -1            -1.0000 -0.0014999989
21        20            21             22            6             0.2431 -1.0000000000
22        21            23             24           12             0.0946 -1.0000000000
23        22            -1             -1           -1            -1.0000 -0.0009999995
24        23            25             26           60             0.0090 -1.0000000000
25        24            27             28            5             0.0067 -1.0000000000
26        25            -1             -1           -1            -1.0000 -0.0009999995
27        26            -1             -1           -1            -1.0000  0.0004999999
28        27            -1             -1           -1            -1.0000 -0.0004999999
29        28            29             30            6             0.0201 -1.0000000000
30        29            -1             -1           -1            -1.0000 -0.0004999999
31        30            31             32           41             0.6935 -1.0000000000
32        31            -1             -1           -1            -1.0000  0.0339994220
33        32            -1             -1           -1            -1.0000 -0.0004999999
>

The above is a snapshot of one of the tree models transformed into a DataFrame. All the rows with "FeatureIndex" equals to -1 are leaf nodes.

library(ROCR)
pr <- prediction(preds, class.labels[test.idx])
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[]
auc

Using the above "AUC" metric to measure the performance of the classifier, we obtained an AUC value of 0.760. With a bit of tweaking of the parameters we can obtain better performance, for example with nrounds=10 and learning rate of 0.01, we obtained an AUC value of 0.780 with the above data.

To compare, we have used the C5.0 decision tree package with number of trials (or adaptive boosting) set to 5.

library(C50)
model <- C5.0(as.matrix(train.dtm), as.factor(class.labels[train.idx]), trials = 5)

preds <- predict(model, as.matrix(test.dtm), type = "prob")[,"1"]


With the above adapative tree bosting model, we obtained an AUC value of 0.754. Both the above method and the adaptive boosting method is comparable in performance with the given dataset.

Theory is inspired from the following resources :

Categories: MACHINE LEARNING, PROBLEM SOLVING