Stokastik

Machine Learning, AI and Programming

Text Classification with Adaboost

Boosting is a general technique by which multiple "weak" classifiers are combined to produce a "super strong" single classifier. The idea behind boosting technique is very simple. Boosting consists of incrementally building a final classifier from an ensemble of classifiers in a way such that the next classifier chosen should be able to perform better on training instances that the current classifier is not able to do.

The concept of a "weak" learner is that the classifier is able to do classification with an error rate of less than \frac{1}{C}, where C is the number of classes. Adaboost is one such boosting algorithm, which "greedily" builds up the final classifier from a pool of weak classifiers. It assigns weights to each weak classifier such that classifiers having higher accuracy are given more weights. More importantly in each iteration of the Adaboost algorithm, it assigns weight to each training example. Training examples which are hard to classify are assigned more weights. To understand the concept intuitively let's take an interesting example.

In an Indian village, more than one people claim that they can predict whether it will rain today or not. They base their prediction based on past few days' weather. But all of them are not equally good. The village panchayat has decided to select only 3 rain oracles such that their combined prediction power is most accurate. Selecting the best rain oracle is an easy task as he is the one with the most accurate predictions. Now the days in which the best oracle failed to predict correctly must be having really "unpredictable" climate and hence the importance of these days becomes much more critical compared to any other day. Thus any oracle correctly predicting the unpredictable days, gains advantage over other oracles. Hence the next best rain oracle, one which correctly predicts more of these "unpredictable" days, is chosen by the panchayat. The "unpredictable" days are now updated based on the combined predictive power of the selected best two rain oracles. There might be some other oracle who did well on the remaining days. The next best oracle is chosen based on how accurate he was on the remaining "unpredictable" days and so on.

From the above example it becomes clear that the objective is not too select the top K classifiers based on their individual error rates on the training examples, but select the K classifiers which in combination gives the best classification accuracy. The difference is that the top classifiers based on individual error rate may have high correlations i.e. many of them predicts correctly or incorrectly for the same examples, in that case combining them does not prove to be really helpful as no additional advantage is gained due to the combination.

Let's demonstrate the Ada-boost algorithm with a pool of M "weak" classifiers  T=\left\{C_1, C_2, ..., C_M\right\} . Each classifier is independently trained on N training examples with different set of hyper-parameters. The Ada-boost algorithm is as follows (binary classification) :

Given the labelled training samples  \left\{(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)\right\} , where the labels  y_i \epsilon \left\{-1, +1\right\} . Objective is to select the best 'm' classifiers so as to minimize the error rate on the N training examples.

Let the vector H={} store the list of final 'm' classifiers selected and W={} the weights of the corresponding selected classifiers in H.

Initially each training examples are given equal weights  D^{(1)}_i=\frac{1}{N} .  Each one of the "weak" classifiers is trained on the N labelled examples. The classifiers are then used for classifying the N examples itself. Generate a NxM matrix X, where the entry  X_{ij}=0 implies that the example 'i' was classified correctly by the 'j' th classifier else  X_{ij}=1 .

For t = 1 to m

a. Compute  \epsilon=X^TD^{(t)} which will give the weighted error rates for each classifier.

b. Based on the weighted error rates of the classifiers in step a., select the one with the least error rate. Let this classifier be denoted  h_t . Add  h_t to H.

c. Assuming the error rate of the selected classifier in step b to be  \epsilon_t , choose :

 \alpha_t = \frac{1}{2}*ln (\frac{1-\epsilon_t}{\epsilon_t}) ,

which is the weight of the classifier  h_t . Add  \alpha_t to W. Since our assumption of the weak classifier theory was that each \epsilon_t < 0.5, hence 0" />. The choice of the weights \alpha_t for the classifiers is not arbitrary, but it is due to the exponential loss function.

d. Now update the weights of each of the N training examples. If an example was correctly classified by  h_t from step 2, then

D^{(t+1)}_i=\frac{D^{(t)}_i}{Z_t}*e^{-\alpha_t}

else if the example was mis-classified by  h_t then

D^{(t+1)}_i=\frac{D^{(t)}_i}{Z_t}*e^{\alpha_t},

where Z_t is the normalization constant.

As you can see that the example weights are updated such that "hard" to classify examples are given more weightage ( e^{-\alpha_t}" />) compared to easily classifiable instances. This ensures that in the subsequent iterations, classifier which does better on the "hard" examples from previous step, are preferred to be included in the final ensemble. But how does one classify an example using final classifier ? The final classifier is of the form :

 F=sgn(\sum_{i=1}^m\alpha_i*h_i)

i.e. for any example  x_j , on applying the function F on x_j, we get

F(x_j)=sgn(\sum_{i=1}^m\alpha_i*h_i(x_j)).

This is +1 whenever the sum inside is positive else it is -1. Similarly to get the classification confidence of the Ada-boost classifier, we just take the sum of the weighted confidences of each individual classifiers  h_t . The error rate of the final classifier can be shown to be upper bounded by :

2\sqrt{\epsilon_t*(1-\epsilon_t)}

Where \epsilon_t is the error rate of the classifier h_t computed from step a of the Ada-boost algorithm.

In order to demonstrate the Ada-boost algorithm using codes, I have used the Titanic dataset from Kaggle website. The dataset can be used to predict which passenger will survive and which will not. It is a binary classification problem, that can be easily solved using a decision tree. We will use the C5.0 package in R as the base classifier (weak learner). To begin with we will just use the dataset without any boosting and compute the prediction accuracy of the survivors.

titanic.data <- read.csv("titanic.csv", header = T)
titanic.data <- titanic.data[!is.na(titanic.data$survived),]

titanic.data$age[is.na(titanic.data$age)] <- mean(titanic.data$age[!is.na(titanic.data$age)])
titanic.data$fare[is.na(titanic.data$fare)] <- mean(titanic.data$fare[!is.na(titanic.data$fare)])
titanic.data$sex <- ifelse(titanic.data$sex == "male", 1, 0)

titanic.data$survived <- ifelse(titanic.data$survived == 1, 1, -1)

We first read the dataset in the .csv format and then substitute the missing values in the age, and fare columns with the mean of the remaining values and then label the "sex" column with 0 (for female) and 1 (for male). We will use the same notation for the class labels as described in the above theory of ada-boost, i.e. denote the binary class labels as -1 or 1.

library(C50)
fit <- C5.0.default(x=titanic.data[,c("pclass", "sex", "age", "sibsp", "parch", "fare")], y=as.factor(titanic.data$survived))
predictions <- predict(fit, titanic.data, type = "class")
sum(predictions == titanic.data$survived)/nrow(titanic.data)

Note that we have not used all the columns from the dataset as a variable but selected only 6 columns that we deem to be important for predicting the survival. With the "C5.0.default" function, the default number of trials is 1 i.e. no boosting done. We obtained a prediction accuracy of 81.7% on the training dataset.

classifier.weights <- c()
instance.weights <- rep(1/nrow(titanic.data), nrow(titanic.data))

preds <- list()

for(i in 1:5) {
  fit <- C5.0.default(x=titanic.data[,c("pclass", "sex", "age", "sibsp", "parch", "fare")], 
                      y=as.factor(titanic.data$survived), 
                      trials = 1, weights = instance.weights)
  
  predictions <- predict(fit, titanic.data, type = "class")
  preds[[i]] <- predictions
  
  iscorrect <- as.numeric(predictions == titanic.data$survived)
  error.rate <- 1-sum(instance.weights*iscorrect)
  cl.weight <- 0.5*log((1-error.rate)/error.rate)
  
  classifier.weights <- c(classifier.weights, cl.weight)
  
  instance.weights <- ifelse(iscorrect == 1, instance.weights*exp(-cl.weight), instance.weights*exp(cl.weight))
  instance.weights <- instance.weights/sum(instance.weights)
}

out <- do.call(rbind, preds)
classifier.weights <- classifier.weights/sum(classifier.weights)

predictions <- as.numeric(t(classifier.weights)%*%out)
predictions <- ifelse(predictions > 0, 1, -1)
sum(predictions == titanic.data$survived)/nrow(titanic.data)

The above code is for doing the ada-boost part. We are storing the weights for each classifier in the final ensemble in "classifier.weights" variable and the predictions from each classifier in the "preds" list (it is a list of vector of predictions from each round of boosting). Instead of our above defined approach, where we already have the weak learners and in each boosting round we select the one with the best performance on the weighted instances, we incrementally build the weak learners, based on the weights assigned to instances as per their difficulty in predicting from the previous round.

In the above code, we do 5 rounds of boosting and in each round update the instance weights based on whether they were classified correctly or incorrectly by the current classifier. The next classifier (next round of boosting) is trained with these updated instance weights. After all the rounds, we combine the classifier outputs by the corresponding classifier weights and compute the final class labels. The final prediction accuracy on the training data with 5 rounds of boosting is 86.6%, i.e. a 5% improvement over that without any boosting.

The theory has been inspired from the following references :

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , ,