Stokastik

Machine Learning, AI and Programming

Dimensionality Reduction using Restricted Boltzmann Machines

Restricted Boltzmann Machine is an unsupervised machine learning algorithm that is useful for doing dimensionality reduction, classification (deep neural networks), regression, collaborative filtering (for recommendation engines), topic modeling etc. The functionality of RBM's are somewhat similar to PCA (SVD), PLSA, LDA etc., which transforms features from the input data into a lower dimension space, capturing the dependencies between different features. RBM's has also been used successfully in problems involving missing/unobserved data. For example in movie recommendation engines, RBM has been used to compute missing/unobserved ratings for movies. But the most groundbreaking application of RBM has been with Deep Learning. RBM's are used to initialize weights in an unsupervised manner in deep neural networks.

RBM falls under the more general class of probabilistic graphical models. An RBM model can be interpreted as a Markov Random Field. But for people like me who are not very well versed with probabilistic graphical models, RBM is just a stochastic Neural Network with an input layer and a hidden layer.

Restricted Boltzmann Machine

The above is an RBM with 4 input units and 3 hidden units. RBM is a generative learning algorithm because given the observed input data (pixels of input images, bag of words from text documents etc.), the RBM tries to model the probability distribution of the observed examples in terms of the observed features (visible units), the hidden data (hidden units) and the unknown weights connecting the visible and the hidden units so as to maximize the likelihood of the observations. For example, consider an input text document X with the feature vector :

X=[v_1=1, v_2=0, v_3=1, v_4=1]

Since each visible unit is independent of each other, we have the probability of the input example X as :

P(X)=P(v_1=1)*P(v_2=0)*P(v_3=1)*P(v_4=1)

How to compute the probabilities P(v_i=1) ?

The joint probability of the visible (v) and the hidden units (h) is given by the Gibbs distribution :

P(v, h) = \frac{1}{Z}e^{-E(v, h)}

where Z is the normalization constant Z=\sum_{v, h}e^{-E(v, h)}, the sum is over all possible configuration of the visible and the hidden units, and

E(v, h)=-\sum_{i=1}^4v_ib_{v_i}-\sum_{j=1}^3h_jb_{h_j}-\sum_{i=1}^4\sum_{j=1}^3w_{ij}v_ih_j

is the energy of a particular configuration, b_{v_i} are the biases for the visible units and b_{h_j} are the biases for the hidden units.

Thus the probability of a particular configuration of the visible units is :

P(v)=\sum_hP(v, h), the sum is over all possible configurations of the hidden units.

The log likelihood of the observation v, denoted as log\;P(v) is :

L_v=log\;P(v)=log\;\sum_hP(v, h)=log\;\sum_h\frac{1}{Z}e^{-E(v, h)}=log\;\sum_he^{-E(v, h)}-log\;\sum_{v, h}e^{-E(v, h)}

For all the N observations (N documents, N images etc.), the overall log likelihood is the summation of the individual log likelihoods L=\sum_vL_v.

In order to maximize the above quantity, which involves log of summation, there is no analytical solution and will use the stochastic gradient ascent technique. In order to compute the unknown values of the weights w_{ij} such that they maximize the above likelihood, we compute using gradient ascent :

w^{k+1}_{ij}=w^k_{ij} + \eta*\frac{{\partial}L}{{\partial}w_{ij}} - \lambda*w^k_{ij} + \nu*(w^k_{ij}-w^{k-1}_{ij})

where \eta is the learning rate, the term \lambda*w^k_{ij} is for regularization purpose so that the magnitude of the weights do not become too large and \nu*(w^k_{ij}-w^{k-1}_{ij}) is the momentum term to prevent oscillations.

Similarly we update the values of the biases b_{v_i} and b_{h_j}.

It can be shown that :

\frac{{\partial}L}{{\partial}w_{ij}}=-\sum_hP(h|v)*\frac{{\partial}E(v, h)}{{\partial}w_{ij}}+\sum_{v, h}P(v, h)*\frac{{\partial}E(v, h)}{{\partial}w_{ij}} and

\frac{{\partial}E(v, h)}{{\partial}w_{ij}}=-v_ih_j

Since each hidden unit is independent from other hidden units, we have :

P(h|v)=P(h_1=1|v)*P(h_2=1|v)*P(h_3=1|v) in the above diagram.

Given the visible unit states, one can compute the hidden unit activations as follows :

P(h_j=1|v)=\sigma(\sum_{i=1}^4w_{ij}v_i+b_{v_i})

where \sigma(x)=\frac{1}{1+e^{-x}} is the sigmoid activation function.

Thus we have :

\frac{{\partial}L}{{\partial}w_{ij}}=\sum_hP(h|v)v_ih_j-\sum_{v, h}P(v, h)v_ih_j

In the above expression, the first quantity represents the expectation of v_ih_j w.r.t. to the conditional probability of the hidden states given the visible states and the second term represents the expectation of v_ih_j w.r.t. to the joint probability of the visible and hidden states.

Thus we can write it as :

\frac{{\partial}L}{{\partial}w_{ij}}=E_{P(h|v)}[v_ih_j]-E_{P(h, v)}[v_ih_j]

To compute the above expectations, we need to know all the possible configurations (v, h) of the RBM, which is an expensive task when the number of visible and hidden units are very large. Instead there is an algorithm called the Contrastive Divergence, which uses Gibbs sampling techniques in order to sample from the visible and hidden unit states that approximately represents the true distribution. Hence we compute the approximate gradient of the likelihood.

In the above expression, the first expectation can be thought in terms that given the observed data v^0, we construct the hidden states h^0 by moving in the forward direction. Whereas the second expectation can be thought as the re-construction v^1 of the observed data from the hidden states h^0 in the backward direction. The difference basically represents how well the hidden states can reconstruct the observed (or visible) states. If the hidden states can almost approximate the original observed data v^0 in the reconstruction phase i.e. v^1 is very close to v^0, then the above difference will be very small and the gradient will also be very small.

The basic Contrastive Divergence (CD-1) algorithm is as follows :

  • Randomly initialize the weights w_{ij}, and the biases b_{v_i} and b_{h_j}.
  • Given an input example with the feature vector v^0, compute the hidden state vector h^0, where

h^0_j=\sigma(\sum_iw_{ij}v^0_i+b_{v_i})

  • Given the hidden units' probability distribution, h^0_j, sample the hidden states. (Generally the visible and the hidden states are binomial i.e. 0 and 1).
  • With the sampled hidden states h^0, re-construct back the visible states v^1, by computing :

v^1_i=\sigma(\sum_jw_{ij}h^0_j+b_{h_j})

  • Again sample the visible unit states v^1_i given the above probability distributions. The states are either 0 or 1.
  • Compute the matrices :

F=v^0(h^0)^T and B=v^1(h^1)^T

  • Update the values of the weights and the biases :

w^{t+1}_{ij}=w^t_{ij}+\eta*(F_{ij}-B_{ij})-\lambda*w^t_{ij} + \nu*(w^t_{ij}-w^{t-1}_{ij})

b^{t+1}_{v_i}=b^t_{v_i}+\eta*(v^0_i-v^1_i)

b^{t+1}_{h_j}=b^t_{h_j}+\eta*(h^0_j-h^1_j)

  • Repeat the above steps for all examples.

In the simplest version, we only reconstruct the visible states once and then take their products with the corresponding hidden states i.e. v^1(h^1)^T and subtract it from v^0(h^0)^T. But one can continue to construct and re-construct for K- iterations and then take their difference, i.e. in the CD-K algorithm, we compute

F=v^0(h^0)^T and B=v^K(h^K)^T

and then take their differences. Although with higher values of K, the time complexity is high but the convergence is faster.

The purpose of this post is to demonstrate the feature reduction functionality of RBM. For that purpose, we will be using the spam vs. ham SMS dataset from Kaggle. Each example data contains the SMS content and the label spam or ham. We will perform the following pre-processing steps on the SMS text contents.

  • Remove non-alphanumeric characters.
  • Remove stop-words (from a pre-defined list of common words)
  • Do not convert the contents to lowercase as supposedly spam data contains capital case letters predominantly. Use the number of full capitalized words in a SMS as a feature.
  • Generate unigrams and bigrams of words as features from the SMS contents.
  • Create the document term matrix out of the extracted features with TF (Term Frequency) weighting.
  • Divide the dataset into training and testing (70:30 split).

In the first experiment, we will use a XGBOOST classifier from R, to model the training examples without any feature reduction, and then predict it on the testing data and compute the precision, recall and the AUC value from the ROC curve. Since the dataset is imbalanced (number of ham instance is much higher compared to spam instance), hence we will use class weight of 2 for the spam instances and weight of 1 for the ham instances.

In the next experiment, we will use the "rbm.train" method from the "deepnet" package in R, to do feature reduction. Basically the method takes the document term matrix and returns the computed weights and biases for the RBM. Then each document (in the visible state) is mapped to the hidden states using the computed weights and biases by the method "rbm.up". We used 1000 hidden units with CD-5 algorithm. After this step, document term matrix is a dense matrix of 1000 features (transformed) which is again passed onto the XGBOOST algorithm to generate the model and similar as above, compute the precision, recall and AUC values and compare the numbers with the above approach.

library(Rcpp)
library(deepnet)
library(xgboost)
library(ROCR)

f <- read.csv("spam.csv")

class.labels <- ifelse(f$v1 == "spam", 1, 0)
text.data <- as.character(f$v2)

text.data <- gsub("[^a-zA-Z0-9/:-_]|\r|\n|\t", " ", text.data)
text.data <- gsub("\b[a-zA-Z0-9/:-]{1,2}\b", "", text.data)


contents <- lapply(text.data, function(x) stripWhitespace(x))

allCapsCounts <- sapply(contents, function(x) length(stringr::str_match_all(x, "\b[A-Z]+\b")[[1]][,1]))

hasAllCaps <- which(allCapsCounts > 0)

In the above code, we are reading the "spam.csv" input data file. Then we are assigning the class label of 1 to spam and 0 to ham. We are removing all characters except for alphanumeric and some other special characters. We are also removing words of length shorter than equals to 2. Then we are removing any extra spaces.

The "allCapsCount" variable holds the number of fully capitalized words per SMS content.

fileFeatures <- cpp__fileFeatures(contents, mystopwords, 1, 2)

dtmCpp <- cpp__createDTM(fileFeatures)

dtmCpp$i <- c(dtmCpp$i, hasAllCaps)
dtmCpp$j <- c(dtmCpp$j, rep(max(dtmCpp$j)+1, length(hasAllCaps)))
dtmCpp$v <- c(dtmCpp$v, allCapsCounts[hasAllCaps])
dtmCpp$Terms <- c(dtmCpp$Terms, "__ALL__CAPS__COUNTS__")


dtm <- slam::simple_triplet_matrix(i=dtmCpp$i, j=dtmCpp$j, v=dtmCpp$v, nrow=max(dtmCpp$i), ncol=max(dtmCpp$j), 
                                   dimnames=list("Docs"=as.character(1:max(dtmCpp$i)), "Terms"=dtmCpp$Terms))

The "cpp__fileFeatures" function is written in C++, that removes stopwords (stored in the "mystopwords" variable) and returns unigrams and bigrams features per SMS content. The "cpp__createDTM" function is also written in C++, that takes as input the extracted features per document and returns a document term triplet matrix in (i, j, v) format.

Next we are adding the counts of fully capitalized words computed earlier to the "dtmCpp" document term triplet matrix.

train.idx <- sample(1:length(text.data), 0.7*length(text.data))

train.dtm <- dtm[train.idx,]
train.classes <- class.labels[train.idx]

test.dtm <- dtm[!(seq(1, length(text.data)) %in% train.idx),]
test.classes <- class.labels[-train.idx]

Next we divide the entire data into training and testing with 70:30 split ratio.

train.dtm.Dmatrix <- xgboost::xgb.DMatrix(data = Matrix::sparseMatrix(i=train.dtm$i, j=train.dtm$j, x=train.dtm$v, dims=c(train.dtm$nrow, train.dtm$ncol)),
                                    label = train.classes, weight = weights)

model <- xgboost(data = train.dtm.Dmatrix, max.depth = 187, nround = 100, eta=0.1, objective = "binary:logistic", nthread = 3)

Next we are building the XGBOOST model with the document term triplet matrix for the SMS content. To use triplet matrix for XGBOOST training, we are converting it to "xgb.Dmatrix" object first. We are assigning weights to each example. Examples belonging to class 1 are assigned weight as 2 whereas all others are assigned weight of 1.

Since this is a binary classification problem, hence we are using the "binary:logistic" cost objective for XGBOOST. Max depth is chosen as the sqrt(Num. features). With the training data, we have around 35K features overall. Thus max depth is sqrt(35000)=187. We have chosen the number of boosting iterations to be 100 and the learning rate (eta) to be 0.1.

test.dtm.Dmatrix <- xgboost::xgb.DMatrix(data = Matrix::sparseMatrix(i=test.dtm$i, j=test.dtm$j, x=test.dtm$v, dims=c(test.dtm$nrow, test.dtm$ncol)))

pred <- predict(model, test.dtm.Dmatrix)

predictions <- as.numeric(pred > 0.5)

p <- table(predictions, test.classes)

precision <- p["1", "1"]/(p["1", "1"]+p["1", "0"])
recall <- p["1", "1"]/(p["1", "1"]+p["0", "1"])

print(paste(precision, recall))

predob = prediction(predictions, test.classes)
auc.perf = performance(predob, measure = "auc")
auc.perf@y.values[[1]]

Next we predict using the model on the testing data and compute the precision, recall and AUC values. With 3900 training data and 1672 testing data, we obtained a Precision of 94.5% and Recall of 82.6% on the SPAM class. The AUC value obtained is 0.91.

Let us check the importance of the each features used by the XGBOOST model to create each tree, using the "xgb.importance" method call :

> importance <- xgb.importance(feature_names = train.dtm$dimnames$Terms, model = model)
> head(importance)
                 Feature       Gain      Cover   Frequence
1: __ALL__CAPS__COUNTS__ 0.43155146 0.07424032 0.123764189
2:                  call 0.08532949 0.03312345 0.031307213
3:                  Call 0.03172337 0.03178389 0.028927133
4:                  text 0.02417417 0.02018733 0.020688393
5:                 claim 0.02279364 0.01928371 0.010252655
6:                  http 0.01596522 0.02067237 0.008604907

The "Gain" column gives the importance of each feature. It denotes how much improvement in accuracy the particular feature brings in if we split a node in a tree on this feature. As you can see that the "__ALL__CAPS__COUNTS__" is the most important feature used in classification. This confirms our hypothesis that most SMS spams contains high degree of fully capitalized words.

But remember that RBM works well with binary vectors, hence we will be converting our Term Frequency simple triplet matrix of document & terms, into binary valued matrix. An entry of 1 indicates the document contains that term and else 0. This makes sense for the N-gram textual features extracted from the SMS's but for the feature, number of words in all caps, we need to break it down into M different features, where M indicates the number of unique word-counts in all caps across the documents. For example, instead of the single feature "__ALL__CAPS__COUNTS__", we will have features like "CAPS__COUNT__0", "CAPS__COUNT__1", "CAPS__COUNT__21" and so on. The entries for "CAPS__COUNT__21" would be 1 where the SMS has exactly 21 words in all caps else 0.

The following code does that :

sapply(unique(allCapsCounts), function(x) {
  
  idx <- which(allCapsCounts == x)
  
  if (length(idx) > 0) {
    dtmCpp$i <<- c(dtmCpp$i, idx)
    dtmCpp$j <<- c(dtmCpp$j, rep(max(dtmCpp$j)+1, length(idx)))
    dtmCpp$v <<- c(dtmCpp$v, rep(x, length(idx)))
    dtmCpp$Terms <<- c(dtmCpp$Terms, paste(c("CAPS__COUNT", x), collapse = "__"))
  }
  
})

dtmCpp$v <- rep(1, length(dtmCpp$v))

Similar as above, we again train the XGBOOST model but with the above document term matrix structure and compute the performance metrics on the testing data.

This time we obtained a precision of 93.8%, recall of 85.7% on the SPAM class and AUC of 0.924. The recall  and the AUC has been improved from the earlier case.

Looking again at the feature importance for the trained xgboost model, we see that instead of "__ALL__CAPS__COUNTS__", we have the feature "CAPS__COUNT__0" at the top.

> importance <- xgb.importance(feature_names = train.dtm$dimnames$Terms, model = model)
> head(importance)
          Feature       Gain      Cover  Frequence
1: CAPS__COUNT__0 0.40945615 0.03748597 0.02000000
2:           call 0.08835322 0.03418186 0.02859813
3:           Call 0.03617390 0.03229482 0.02990654
4:          claim 0.02323864 0.01897817 0.01065421
5:           text 0.02211772 0.01882530 0.02018692
6:            Txt 0.01948285 0.02797493 0.02205607

Indicating that most ham SMS's are correctly classified if it does not contain any word in all caps.

Note : In problems with imbalanced datasets such as this one, instead of accuracy as a measure of performance, use of precision, recall, and the Area Under Curve (AUC) values is recommended.

Next, we use the "rbm.train" function to train the weights and biases of an RBM with the training features.

rbm <- rbm.train(x = as.matrix(train.dtm), hidden = 1000, cd = 5, numepochs = 10)

train.dtm.reduced <- rbm.up(rbm, as.matrix(train.dtm))
train.dtm <- as.simple_triplet_matrix(train.dtm.reduced)

test.dtm.reduced <- rbm.up(rbm, as.matrix(test.dtm))
test.dtm <- as.simple_triplet_matrix(test.dtm.reduced)

We are using CD-5 algorithm (refer to the derivation of Contrastive Divergence) with 1000 hidden units. After the training of weights and biases with training features, we are transforming each training and testing document into the hidden state space using the "rbm.up" method call. All the other steps for building the model and doing prediction on testing data remains same as before.

On transforming 35K input features into the hidden state with 1000 features, we now train the XGBOOST model with a depth of 32 (sqrt(1000)). We see that the model converges on the training data in very few number of boosting rounds. Instead of 100 rounds we will do only 30 rounds of boosting.

model <- xgboost(data = train.dtm.Dmatrix, max.depth = 32, nround = 30, eta=0.1, objective = "binary:logistic", nthread = 3)

With the above model, we similarly did prediction on the testing data as before. We obtained a precision of 88.2% and a recall of 81% on the SPAM class and the AUC value of 0.896.

Although the precision, recall and AUC values are lower than with training the model with the full set (35K) of input features, but the performance numbers are quite close and experimenting with different number of hidden units, we can achieve comparable performance. The only drawback that I faced is the computation time for training the RBM. It took me several hours (>3 hrs) to train the RBM on a shared AWS EC2 instance.

On looking at the important features (from the original 35K features) used by XGBOOST, we see that mostly uni-grams has been deemed important. Hence we can ignore the bi-grams. This reduces the original input features from 35K to 10K. The performance numbers for recall (85.7%) and AUC(0.924) remains same while precision(94.2%) on SPAM increases a bit.

Next I trained the RBM with the 10K visible units corresponding to the input features and 500 hidden units with the CD-5 algorithm.

rbm <- rbm.train(x = as.matrix(train.dtm), hidden = 500, cd = 5, numepochs = 10)

The computation time reduced to a great extent with the reduced number of visible and hidden units. Not only that, the precision, recall and the AUC values also increases. Precision is 93.2% and Recall increases to 83% on the SPAM class. The AUC value also increases to 0.91.

 

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,