Stokastik

Machine Learning, AI and Programming

Classification with Imbalanced Data Sets

In credit card fraud analysis, most datasets are highly skewed since the number of valid transactions far outweighs the number of fraudulent transactions (in most cases, the ratio of valid transactions to fraudulent transactions could be as skewed as 98% to 2%). Without fitting a classification model to the training data if we simply predict any unknown transaction as a valid transaction, we would be correct 98% of the time.

Even if we fit a model to the training data, the model would overfit to the class representing valid transactions (in order to reduce the overall training error) and the model would most likely to classify an unknown transaction as valid.

But that is not what we want. Given a fraudulent instance, we want to predict it as fraudulent with high confidence, even if that means we predict completely valid transactions as fraudulent in certain cases.

We will be reviewing some popular techniques to deal with class imbalance in this post. We will be using the Credit Card fraud detection dataset from Kaggle to demonstrate each technique.

Few of the popular techniques to handle class imbalance datasets are :

  • Using a different metric other than accuracy of the classifier to judge performance.
    • Precision and Recall are preferred over accuracy in case of imbalanced data.
    • For classifiers estimating class probabilities we can plot the ROC curve of the predictions and use the AUC (Area Under Curve) value as a performance metric for the classifier.
  • Under-sampling and Oversampling.
    • Under-sampling the majority class (class having proportionately large number of samples e.g. all valid transactions) and/or oversampling the minority class (class having very few samples e.g. all fraudulent transactions) so that the number of samples for both class are balanced.
    • Create ensemble of models with different sampled data and use majority vote of the ensembles as the output.
  • Synthetic generation of samples from the minority class using the SMOTE algorithm.
  • Assigning class weights
    • Assign high misclassification cost to minority examples.

We will look at each of the approaches above in details. First let's read the credit card fraud dataset in R, then split the dataset into training and testing (70-30 split). We will normalize each variable (feature) by using the formula :

\frac{X-min(X)}{max(X)-min(X)}

normalize <- function(v) (v-min(v))/(max(v)-min(v))
credit.card.data <- read.csv(file="/Users/funktor/Downloads/creditcard.csv")

credit.card.data.normalized <- credit.card.data
credit.card.data.normalized[,1:30] <- apply(credit.card.data.normalized[,1:30], 2, normalize) 

train.idx <- sample(1:nrow(credit.card.data.normalized), 0.7*nrow(credit.card.data.normalized))
test.idx <- seq(1, nrow(credit.card.data.normalized))[-train.idx]

train.data <- credit.card.data.normalized[train.idx,]
test.data <- credit.card.data.normalized[test.idx,]

train.classes <- train.data$Class
test.classes <- test.data$Class

Out of 199364 examples in the training data, only 345 (0.173%) belongs to the positive class i.e. fraud transactions. Similarly out of 85443 examples in the testing data, only 147 (0.172%) belongs to the fraud transactions. Thus the training and testing set has similar distributions.

Let's start by fitting a logistic regression model to the training data. We will use the "glm" function from with the binomial family :

model <- glm(Class ~., family=binomial(link='logit'), data=train.data)

After the model is built, let's look at the summary of the model, how well does the model fit to the training data, which are the statistically significant variables etc.

summary(model)

We get the below output :

Call:
glm(formula = Class ~ ., family = binomial(link = "logit"), data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0430  -0.0289  -0.0191  -0.0121   4.5960  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  61.6422    17.6983   3.483 0.000496 ***
Time         -0.5199     0.4692  -1.108 0.267851    
V1            5.2274     2.9200   1.790 0.073417 .  
V2            0.6359     6.2319   0.102 0.918727    
V3           -1.5868     3.6366  -0.436 0.662598    
V4           15.8418     1.9359   8.183 2.76e-16 ***
V5           11.8327    12.1337   0.975 0.329467    
V6          -12.6074     9.0344  -1.395 0.162867    
V7          -11.6312    13.0690  -0.890 0.373473    
V8          -17.6993     3.3686  -5.254 1.49e-07 ***
V9           -7.6414     3.7619  -2.031 0.042227 *  
V10         -33.8980     5.5055  -6.157 7.41e-10 ***
V11           1.3749     1.6375   0.840 0.401095    
V12           2.4003     2.7484   0.873 0.382473    
V13          -4.5250     1.2564  -3.602 0.000316 ***
V14         -17.7989     2.2220  -8.010 1.14e-15 ***
V15          -1.6549     1.3848  -1.195 0.232065    
V16          -7.1628     4.5380  -1.578 0.114471    
V17          -0.5475     2.8267  -0.194 0.846424    
V18          -0.5300     2.1760  -0.244 0.807586    
V19           1.1120     1.4756   0.754 0.451087    
V20         -36.0477     8.9585  -4.024 5.73e-05 ***
V21          22.7594     4.4387   5.128 2.94e-07 ***
V22          13.3745     3.3963   3.938 8.22e-05 ***
V23          -3.1440     5.0564  -0.622 0.534085    
V24           1.0899     1.3232   0.824 0.410116    
V25           0.1134     2.8459   0.040 0.968208    
V26           0.6039     1.3403   0.451 0.652264    
V27         -51.0509     7.5735  -6.741 1.58e-11 ***
V28         -22.2150     6.2780  -3.539 0.000402 ***
Amount       19.3277    11.4661   1.686 0.091865 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5077.3  on 199363  degrees of freedom
Residual deviance: 1562.9  on 199333  degrees of freedom
AIC: 1624.9

Number of Fisher Scoring iterations: 12

The co-efficients table gives an estimated weight for each feature (in the first column) and the p-value statistical significance in last column.

If a feature has a p-value less than 0.005, then it is considered to be statistically significant. The greater the number of stars at the end of each row, the more statistically significant that feature is for classification.

Let's do a prediction on the testing data with the built model.

pred <- predict(model, newdata = subset(test.data, select=seq(1,30)), type = 'response')

The 'pred' variable contains the probabilities for each testing row belonging to the class 1 i.e. a fraudulent transaction.

To obtain the predicted class, we use a threshold (generally 0.5) such that all predictions with probabilities below the threshold are classified into the negative class (valid transactions) and all predictions with probabilities above the threshold are classified into the positive class (fraud transactions).

Using 0.5 as the threshold, we obtain :

out <- ifelse(pred > 0.5, 1, 0)

Now to naively judge the performance of the classification, we compute the ratio of the correct predictions to the total predictions, which gives us a 99.92% accuracy. Which looks impressive at first sight.

Let's look at the confusion matrix for the testing data.

   test.classes
out     0     1
  0 85275    45
  1    21   102

Although our classification accuracy is 99.92%, but for any fraud transaction, it is correctly detected to be a fraud with only 102/147=69.4% (Recall).

Use performance metric other than Accuracy

Precision gives the fraction of predicted positive instances, predicted correctly. In our example it is 102/(102+21)=83%, whereas Recall gives the fraction of actual positive instances, predicted correctly, in our example it is 69.4%.

\text{Precision}=\frac{TP}{TP+FP}

\text{Recall}=\frac{TP}{TP+FN}

These metrics gives us better insights into the performance of the classifier w.r.t. the positive class.

Earlier we had chosen a threshold value of 0.5, for the positive class i.e. any prediction with probability greater than 0.5 is classified to be positive. If instead we had taken the threshold value to be 0.1, the precision would be around 78% and recall would greatly improve to 84.3%.

If better recall is our objective then threshold of 0.1 is better than threshold 0.5.

Precision-Recall Curve

Precision-Recall Curve

A good threshold value should improve the true positive rate more than the false positive rate. True positive rate is the fraction of actual positive instances predicted to be positive, whereas false positive rate is the fraction of actual negative instances predicted to be positive.

When you decrease the threshold value, the number of positive predictions increases which will increase both the number of correct predictions (true positives) as well as the number of incorrect predictions (false positives). In an ideal scenario, all positive instances are predicted to be greater than some threshold value and all negative instances are predicted to be lower than the same threshold value. In such case true positive rate is 1.0 and false positive rate is 0.

The plot of the true positive rate vs false positive rate with varying values of probability thresholds is called the ROC curve.

The R code to generate the ROC Curve :

pr <- prediction(pred, test.data$Class)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
ROC Curve

ROC Curve

The measure of the performance of the classifier is give by the area under the curve AUC.

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

The AUC value is 0.98. Higher the value of the AUC, better is the performance of the classifier on an imbalanced dataset. From the curve, we can estimate that for some particular value of threshold, FPR is nearly 0 when the TPR is around 82%.

From the output results we found that with threshold of 0.01, TPR is 88.4% and FPR is 0.3%. The precision is 32% and recall is 88.4% at threshold of 0.01.

Undersampling and Oversampling

Since the contribution of the positive class in our example dataset is only about 0.17%, one obvious way to increase the share of the positive examples is by replicating the same examples.

Sample with replacement from the positive class and add them back again to the training data set, thus increasing the percentage of samples for the positive class. This method is known as oversampling.

The disadvantage with this approach is that the variance of the features are reduced. But if same example is replicated 5 times in the training data, hence if a classifier makes a false negative error on this example, then the classifier will have made 6 times more error for each oversampled data.

Let's sample with replacement from the positive class, 5 times the actual number of positives examples :

indices <- which(train.classes == 1)
indices <- sample(indices, 5*length(indices), replace = T)

Then we update the training data :

train.data <- rbind(train.data, train.data[indices,])
train.classes <- c(train.classes, train.classes[indices])

Now we again build the GLM model and compute the precision, recall and the AUC values.

With probability threshold value of 0.5, we obtain precision of 78% and recall of 86.4%. On plotting the ROC curve and computing the AUC, we obtain 98.3% AUC value (0.3% improvement).

In undersampling, we remove some of the examples from the negative (majority) class to balance the negative and positive class.

The disadvantage with this approach is that information might get lost about discriminative features important for separation between the negative and positive classes. If the negative class contains multiple "concepts" (topics/clusters), then if samples from a smaller "concept" are removed then that "concept" gets weakened and the classifier cannot properly identify that "concept".

Concepts in classification, taken from "Learning from Imbalanced Data" by H. He

Concepts in classification, taken from "Learning from Imbalanced Data" by H. He

In the above diagram the circles are examples from the negative class (majority) and the stars are examples from the positive class (minority). The negative class has 2 sub-concepts A and D.

With undersampling, we will be using an ensemble of logit models, since we have very few samples from both the classes to work with.

We will use each model for prediction on the testing data, which will generate a matrix of probabilities, each row will represent the prediction results from one model. We will be taking the mean probability value from the 100 models for each document.

Let's look at how our logistic regression classifier performs with undersampling from the majority class.

under.sampled <- function(idx0, idx1, train.data) {
  idx0 <- sample(idx0, 0.5*length(idx0))
  idx1 <- sample(idx1, 0.5*length(idx1))
  
  idx0 <- sample(idx0, length(idx1))
  
  train.data <- rbind(train.data[idx0,], train.data[idx1,])
  
  glm(Class ~., family=binomial(link='logit'), data=train.data)
}

where

idx0 <- which(train.classes == 0)
idx1 <- which(train.classes == 1)

Creating the 100 models :

models <- lapply(1:100, function(i) under.sampled(idx0, idx1, train.data))

and generating the prediction probability matrix for 100 models :

preds <- do.call(rbind, lapply(models, function(model) predict(model, newdata = subset(test.data, select=seq(1,30)), type = 'response')))

Considering the median probabilities for each testing document.

pred <- apply(preds, 2, mean)

On plotting the ROC curve, and finding the AUC value, we obtain 95.6%, the area under the curve :

At a threshold value of 0.5, we obtain a precision of 4.8% and recall of 91.1% for the positive class. Thus we see that the undersampling method significantly improves the detection of fraud instances (positive class) but predicts more of valid transactions as fraud, which is not exactly undesirable use-case given that the cost of mis-classifying a fraud is much higher than the cost of mis-classifying a valid transaction as frauds, in high risk applications such as credit card fraud detection, cancer detection etc.

There has been research on modified under-sampling approaches :

  • Removing Tomek Links. Tomek links are all those pairs of samples, where in each pair, the two samples are each other's nearest neighbor but one belongs to the positive class and the other to the negative class.
    • From each Tomek link, the sample belonging to the negative class is removed.
    • Negative class samples which are more similar to the positive class samples (as in Tomek links) will create more confusion for the classifier and hence they are removed before building the model.
Tomek Links removal, downloaded from "https://svds.com/learning-imbalanced-classes/"

Tomek Links removal, downloaded from "https://svds.com/learning-imbalanced-classes/"

  • Using an ensemble of decision trees instead of logistic regression for bagging and boosting, since decision trees are non-linear classifiers unlike logistic regression.
  • Remove samples from majority class which are not very informative.
    • For example, samples from majority class whose nearest neighbors are all or mostly from the majority class itself, then this sample does not contain enough "discriminative features" and can be removed.
  • Remove samples from majority class which are surrounded by minority class samples.
    • For example, samples from majority class whose nearest neighbors are all or mostly from the minority class (similar to Tomek Links) will create confusion for the classifier. Tomek Links are a special case of this approach which uses 1-nearest neighbor.

Using the C5.0 R function for creating an ensemble of boosted decision trees, modify the "under.sampled" function defined above :

under.sampled <- function(idx0, idx1, train.data) {
  idx0 <- sample(idx0, 0.5*length(idx0))
  idx1 <- sample(idx1, 0.5*length(idx1))
  
  idx0 <- sample(idx0, length(idx1))
  
  train.data <- rbind(train.data[idx0,], train.data[idx1,])
  
  C5.0(train.data[,1:30], as.factor(train.data[,31]), trials=20)
}

"C5.0" function creates boosted decision trees. The "trials" parameters specifies the number of boosting iterations to be done. We keep at 20 iterations.

preds <- do.call(rbind, lapply(models, function(model) predict(model, newdata = subset(test.data, select=seq(1,30)), type = 'prob')[, "1"]))

Instead of returning class labels from the C5.0 prediction, we are returning the class probabilities. The above would only get the class probabilities for the positive class. On taking the median value of the probabilities for each model as before, and then using 0.5 as the threshold, we obtain the following confusion matrix.

   test.classes
out     0     1
  0 83525    15
  1  1771   132

The AUC is 98.3% same as above. We obtained a precision of 6.9% but a recall of 89.8% on the positive class. Without any sampling if we build a C5.0 model with the full training data and classify the test data, we obtain a precision of 85.1% with recall of 85.7% on the positive class.

Note : The precision, recall and AUC numbers would vary with ensembling as the training and testing data are sampled.

Since the number of samples in the training data is quite large (199364), finding the pairwise distance matrix would take long time and would require memory which is out of bounds on my personal laptop.

So instead of finding the Tomek links or finding which negative class samples are mostly surrounded by samples from positive class examples, we use an alternative less memory intensive and faster technique.

  • Compute the centroid of all positive class samples. Call this C.
  • Compute the mean distance of positive class samples from the centroid C.
  • Compute the distance of each negative class sample from the positive class centroid C.
  • For any negative class sample which is more closer to the centroid C than 50% of all positive class samples from C in step 2 above, remove these negative class samples from training.

Basically we are removing all negative class samples which are more closer to the positive class centroid compared to 50% of the positive class samples.

distance.measure <- function(vec, mat) {
  mat2 <- t(vec-t(mat))
  rowSums(mat2*mat2)
}

"distance.measure" function computes the euclidean distance between a vector and a matrix.

idx0 <- which(train.classes == 0)
idx1 <- which(train.classes == 1)

centroid <- colMeans(train.data[idx1,1:30])
mean.dist <- mean(as.numeric(distance.measure(centroid, train.data[idx1,1:30])))

dists <- distance.measure(centroid, train.data[idx0,1:30])
idx2 <- which(dists > mean.dist)

train.data <- rbind(train.data[idx2,], train.data[idx1,])

We obtained around 23% reduction in the training data after the removal of all such negative class samples. We trained the logistic regression model on the updated training data. Precision was 81% and Recall was 80% (around 11% improvement from the model with whole data for the negative class). AUC value increased from 98% to 98.4%.

On combining the above under-sampling approach with 200% oversampling of the minority class, we obtain a precision of 79%, recall of 87% and AUC value of 98.6%, which is probably the best result with logistic regression obtain from all the above techniques.

SMOTE

SMOTE (Synthetic Minority Oversampling Technique), is one of the oversampling techniques for the minority class.

Earlier we were sampling with replacement from the same examples and adding them back again, no new examples were generated but the cost of misclassification of a positive example was multiplied by the oversampling factor. Whereas in SMOTE, we generate synthetic samples for the positive class.

The algorithm is as follows :

  • For each positive class example P, find its K nearest neighbors from the positive class only.
  • Depending on the oversampling factor M (<K), sample M random nearest neighbors from the above step.
  • For each of the M nearest neighbors above, generate a new example as follows.
  • Let P' denote one of the nearest neighbor. Then generate a random number H between 0 and 1, multiply H with the difference between P and P' and add the result to P. Thus the new example P'' being generated lies along the same line as P and P'.

P''=P+H*(P'-P), where H is a random number between 0 and 1

  • Thus for each P, we generate M different P'' new examples.

SMOTE, pic downloaded from StackOverflow "http://stackoverflow.com/questions/19089913/data-imbalance-in-svm-using-libsvm"

One of the drawbacks of the SMOTE is that it can only generate positive class samples within the body of the already existing samples, no new variants (outside the sphere of positive class examples) are generated with this approach.

dist.mat <- as.matrix(dist(train.data[idx1,1:30], method = "euclidean", diag = T, upper = T))
orig.mat <- train.data[idx1,1:30]

docs <- rownames(dist.mat)

new.data <- do.call(rbind, lapply(1:nrow(dist.mat), function(i) {
  p <- as.integer(names(sort(dist.mat[i,])[2:6]))
  p1 <- sample(p, 2)
  m <- sample(1:1000, 1)/1000
  do.call(rbind, lapply(p1, function(j) orig.mat[docs[i],] + m*(orig.mat[as.character(j),]-orig.mat[docs[i],])))
}))

new.data$Class <- rep(1, nrow(new.data))

train.data <- rbind(train.data, new.data)

In the above code, we first compute the pairwise distance matrix for the positive class examples, then for each positive class, compute the 5 nearest neighbors and then sample 2 of them. Generate synthetic examples in between the lines connecting the positive example and each of the sampled nearest neighbor.

With the above implementation of SMOTE, we obtained a precision of 81% and recall of 83% (with threshold 0.5) using logistic regression. The AUC value obtained is 98.3%.

There are other variations of this method that looks to improve on SMOTE.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , , ,