Stokastik

Machine Learning, AI and Programming

Matrix operations to the rescue in R

One of the main drawbacks of R is the inefficiency of looping operations. Since R inherently is a functional programming language, many looping operations can be converted into map operations by choosing the appropriate functional forms. Although such a mapping operation speeds up the program, but sometimes we need still better speedups (if we compare similar programs written in C or C++). In such cases, we will see that by cleverly transforming an iterative problem into a sequence of matrix (or vector) operations, we can achieve comparable amount of performance to a C program.

Consider the problem, that given a vector of size N, our goal is to compute the sum of squares of each of the elements in the vector. In the normal procedural approach one would write the following R code to do the operation :

procedural.sum.squares <- function(vec) {
  sum <- 0
  for (x in vec) sum <- sum + x^2
  sum
}

Let's generate a random vector of 1 million real numbers between 1 and 1000.

x <- sample(1:1000, 1000000, replace = T)/1000

On calling the above defined function with the input vector 'x' and estimating the time for the entire operation :

system.time(procedural.sum.squares(x))
   user  system elapsed 
  0.371   0.001   0.372

In the above output, we consider the "elapsed" time as our running time for the above program, i.e. it is 0.372 seconds. If we replace the above function with the following function :

functional.sum.squares <- function(vec) sum(vec^2)

Similar as above, we estimate the time taken to compute the sum of squares with the input 'x' :

system.time(functional.sum.squares(x))
   user  system elapsed 
  0.005   0.001   0.006

The elapsed time in this case is 0.006 seconds only. In the last implementation of the sum of squares R program, it takes advantage of the functional programming aspect of R. The operation '^2' applied to a vector 'vec' can be thought of as applying a recursive function :

vec.squared <- function(x) if (length(x) == 1) x[1]*x[1] else c(x[1]*x[1], vec.squared(x[2:length(x)]))

In yet another approach to solving the sum of squares problem, we can generate the sum of squares through matrix multiplications. Consider a matrix X with dimensions Nx1, then the value obtained from the following matrix operation :

X^TX

gives the same value as the sum of squares of the elements in the vector X.

matrix.sum.squares <- function(vec) t(vec)%*%vec

Although we are successful in transforming a iterative approach of finding sum of squares into a matrix operation, but the matrix operation 'matrix.sum.squares' do not provide any significant improvement over the 'functional.sum.squares' approach for this particular example. Next we will look into 3 different problems where transforming the problem into matrix operations offers significant improvements in run-time complexity.

Problem 1 : In linear regression problems, the model for the input data (X_i, y_i) is computed as :

h_{\theta}(X_i) = \frac{1}{2}\sum_{j=1}^D{\theta_j}x^{(i)}_j

where \theta_j are the unknown parameters to be estimated, x^{(i)}_j are the components of the i-th example and D is the dimension of each example. Given that there are N examples X_i and each example is of D dimension, we have a NxD dimensional matrix X. \theta depends only on the feature component j, i.e. \theta_j is same for all examples. Given an estimate of \theta, the predicted outputs h_{\theta}(X_i) from each example X_i is computed.

In the iterative approach, we take the dot product of the vector \theta with the input vector X_i :

model <- function(theta, input) sapply(1:nrow(input), function(i) 0.5*sum(theta*input[i,]))

Note that, the formulation of h_{\theta}(X_i) easily allows us to convert the above iterative approach for each example into a compact matrix representation :

h_{\theta}(X)=\frac{1}{2}X{\theta}

Note that the above formulation computes the values in a batch fashion. The corresponding code in R :

model <- function(theta, input) 0.5*input%*%theta

To analyze, how does the matrix batch operation performs compared to the corresponding iterative approach, we do a benchmark analysis with both the methods, keeping the number of examples N constant and varying only the number of dimensions D. We randomly generate using sampling the input matrix X and the vector \theta. Considering a constant N=100 examples and varying the number of features D from 1000 to 100000 in steps of 100, we observe that the running time for the iterative approach increases linearly with D, whereas the running time in the matrix based approach remains almost constant throughout.

Matrix Batch Processing

Matrix Batch Processing

The red line show the evolution of the run-time with the number of features for the iterative approach whereas the green line is the corresponding plot for the matrix batch operation.

Problem 2 : In a typical supervised text classification problem,  we usually prepare the input data as a document term matrix X, where the rows represents the documents and the columns represents the set of features (N-grams) extracted out of these documents. Assuming a binary matrix, the entry X_{ij}=1 if the document 'i' contains the feature 'j' else X_{ij}=0. Each document is assigned to one of C classes. We want to create a class term matrix Y, where the entry Y_{ij}=1 if class 'i' contains feature 'j' else 0 using an already created document term matrix X.

In the iterative approach, we iterate over each column of X, if X_{ij}=1, then Y_{C(i), j}=1 where the function C(i) maps the document 'i' to its corresponding class. In R, we can use a named vector to map the documents to its class labels.

class.term.matrix <- function(dtm, classes) {
  ctm <- matrix(0, nrow=length(unique(classes)), ncol=ncol(dtm))
  
  sapply(1:ncol(dtm), function(i) {
    docs <- which(dtm[,i] == 1)
    cl <- classes[docs]
    ctm[cl,i] <<- 1 
  })
  
  ctm
}

In the above R code, for each column 'j' in the document term matrix 'dtm', we extract the rows which has the entry of 1 for that column only. These rows represents the documents containing the feature represented by the column 'j'. Using the map 'classes', we assign 1 to all rows in the class term matrix 'ctm' for column 'j' only and so on. Assuming that we have N=1000 documents and D=20,000 features, the matrix 'dtm' is of size 1000 x 20,000. We have 10 classes and each class has 100 documents. On running the above program with this input, we get a run-time of 0.973 seconds.

Can we do better ?

Matrix operations comes to our rescue. Note that given a matrix X of dimensions NxD and another matrix Z of dimensions NxC, then Z^TX is a valid matrix operation and we get a resultant matrix of dimensions CxD. We use this information to transform the above iterative approach to a sequence of matrix operations.

We already have the document term matrix X of dimensions NxD (rows corresponds to documents and columns to features), we need a document class matrix Z of dimensions NxC (rows corresponds to documents and columns to classes). Question is how can we transform the vector 'classes' to the document class matrix ?

Let the vector 'classes' be equals to {1,1,1,1,2,2,3,3,3}, i.e. the first 4 documents belongs to class '1', the next two to class '2' and the last 3 documents to class '3'. We have 3 classes {1,2,3}.

Let's denote

V_1=\left\{1,1,1,1,2,2,3,3,3\right\} and

V_2=\left\{1,2,3\right\}.

Can we somehow do an operation 'op' such that when we take an outer product of V_1 and V_2 and get back a 9x3  matrix Z such that Z_{11}=1, Z_{21}=1, Z_{52}=1, Z_{53}=0, ... and so on, i.e.

Z=V_1\;\text{op}\;V_2 such that Z_{ij}=1 if V_1[i]=V_2[j] else Z_{ij}=0

In R, there is a very nice function called outer, which does custom outer product of arrays. Basically what we need is the following transformation :

outer(v1, v2, function(x,y) ifelse(x==y, 1, 0))

The above code returns the document class matrix Z, of dimensions CxD. Once we obtain the document class matrix Z, our goal is to obtain the desired class term matrix Y using the matrices X and Z. If we do the operation Y=Z^TX, then what we essentially get is a class term matrix, but the values Y_{ij} represents that in how many documents belonging to class 'i' does the feature 'j' occurs. We can either choose to keep this representation or if binary representation is required, we can replace all Y_{ij}{\ge}1 with 1.

class.term.matrix2 <- function(dtm, classes) {
  w <- outer(classes, unique(classes), function(x, y) ifelse(x==y, 1, 0))
  ctm <- t(w)%*%dtm
  ctm[ctm >= 1] <- 1
  
  ctm
}

On running the matrix variant of the class term matrix code, we get a runtime of only 0.096 seconds, almost 10 times faster than the iterative approach. We did a benchmark analysis with the two methods and found out similar trends as the outcome of problem 1 above.

Matrix Batch Operation Class Term

Matrix Batch Operation Class Term

The red line shows the evolution of the running time for the iterative approach whereas the green line shows the running time for the sequence of matrix operations approach. As clearly seen, that the matrix operations performs really well compared to the iterative approach.

Problem 3 : In the Hidden Markov Model, given the set of possible observations \left\{Q\right\}, the set of possible hidden states \left\{S\right\}, the transition probability matrix T(S_i, S_j) and the emission probability matrix E(S_i, Q_i), our goal is to compute the probability of a sequence of observations.

Given the N observations \left\{q_1, q_2, ..., q_N\right\}, define the quantity F(i, S_j), which represent the probability of the i-th observation in the sequence generated by the hidden state S_j.

Using the Forward Algorithm, we have :

F(1, S_j)=T(S_0, S_j)*E(S_j, q_j)

F(i, S_j)=\sum_{k=1}^{|S|}F(i-1, S_k)*T(S_k, S_j)*E(S_j, q_j) for i = 2 to N

The probability of the sequence of observation is

\sum_{k=1}^{|S|}F(N, S_k)*T(S_k, S_E)

where S_0 is the starting and S_E is the ending hidden state. To compute the probabilities of each state for each observation, we start with two vectors in R, 'curr.state.probs' stores the state probabilities for the current observation while 'last.state.probs' stores the state probabilities for the last observation.

state.probs <- function(transition.probs, emission.probs, observations, states) {
  curr.state.probs <- sapply(states, function(state) emission.probs[state, observations[1]])
  curr.state.probs <- curr.state.probs/sum(curr.state.probs)
  
  last.state.probs <- curr.state.probs

  for (i in 2:length(observations)) {
    curr.state.probs <- sapply(states, function(state1) 
      sum(sapply(states, function(state2) last.state.probs[state2]*transition.probs[state2, state1]))*emission.probs[state1, observations[i]])
    
    curr.state.probs <- curr.state.probs/sum(curr.state.probs)
    
    last.state.probs <- curr.state.probs
  }
  
  curr.state.probs
}

We can obtain the same results using matrix operations instead of the iterative approach to compute the 'curr.state.probs'. Mathematically the above code is same as the following expression :

F_i=(F_{i-1}T).Z_i,

'.' represents the dot product

where F_i represents the 'curr.state.probs' vector and F_{i-1} represents the 'last.state.probs' vector. T is the transition matrix and Z_i is a vector such that Z^j_i=E_{j, q_i} i.e. the emission probability of state S_j generating the observation q_i or in other words Z_i is the vector corresponding to the column q_i in the emission matrix.

state.probs2 <- function(transition.probs, emission.probs, observations, states) {
  curr.state.probs <- sapply(states, function(state) emission.probs[state, observations[1]])
  curr.state.probs <- curr.state.probs/sum(curr.state.probs)
  
  last.state.probs <- curr.state.probs
  
  for (i in 2:length(observations)) {
    curr.state.probs <- (last.state.probs%*%transition.probs)*emission.probs[,observations[i]]
    curr.state.probs <- curr.state.probs/sum(curr.state.probs)
    
    last.state.probs <- curr.state.probs
  }
  
  curr.state.probs
}

To do the benchmark analysis on the data, we randomly sampled data to create the transition probability and the emission probability matrices and also the sequence of observations. We chose 100 hidden states and 200 observed states. For an observed sequence of length 50, the iterative approach 'state.probs' completed in about 2.342 seconds, whereas the second approach of using matrix operations 'state.probs2' completed in only 0.003 seconds. We obtained a comparison of the running times for both methods by varying the length of observation from 10 to 100 in steps of 10.

      t1    t2
1  0.413 0.001
2  0.909 0.000
3  1.307 0.001
4  1.825 0.001
5  2.215 0.002
6  2.660 0.002
7  3.149 0.002
8  3.553 0.002
9  4.101 0.003
10 4.571 0.002

The t1 column represents the running times with the iterative approach, whereas the t2 column represents the running times with the matrix based solution. As clearly seen that the matrix based is very very fast comparatively to the iterative approach.

We have demonstrated only 3 examples where clever manipulation of linear algebra leads us to super fast solutions compared to the normal procedural approach of solving problems. In fact any slow iterative or procedural approach could be converted into a faster solution using some linear algebra tricks in R.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,