Stokastik

Machine Learning, AI and Programming

Heard in the Cafeteria - ML Interview Questions - Part 2

Linear Algebra, PCA and SVD

  1. What are Symmetric Positive Definite Matrices ? Explain with an example and application.
    • Write code in python to generate a random SPD matrix
  2. How would you find the square root of a covariance matrix ?
  3. What is the purpose of PCA in machine learning ? How does it help in dimensionality reduction ?
  4. Given a document-term matrix, give step by step process of how would would find the PCA of the matrix ?
    • How would you choose the final dimensionality of the transformed matrix ?
    • What if the number of examples is much less than the number of features ?
  5. Given a document-term matrix, how would you obtain the principal components using SVD technique ?
    • What is the space and time complexity of SVD and PCA ?
  6. Explain how SVD (or Latent Semantic Analysis) works in the case of semantic search ?
  7. You have 100 unknown variables and 200 linear equations to be solved. How can you solve for the unknowns ?
    • Why is Moore-Penrose PseudoInverse such an important technique ?
  8. How can you compute the PageRanks of 100 billion web-pages efficiently ?
  9. How to perform PCA on a very large sparse matrix ?
    • How is TruncatedSVD implemented in sklearn ?

 

Answers and Hints

Q.1

Let A be a Symmetric Positive Definite (SPD) matrix, then all the eigenvalues of A are positive. Also for any non-zero vector 'x', xTAx > 0. Covariance matrix is SPD.

Python code to generate random SPD matrix

a = np.random.uniform(-1, 1, (7,4))
b = np.dot(a.T, a)

Q.2

Let A be a NxM matrix, then the covariance matrix of A is given by:

C=\frac{1}{M-1}(A-\tilde{A})(A-\tilde{A})^T, where \tilde{A} is the mean of each row of A

C can be decomposed into the form C=VDV^T where V is the matrix where the columns are the eigenvectors of C and D is a diagonal matrix where the diagonals are the eigenvalues of C.

The square root of C is computed to be VD^{\frac{1}{2}}.

a = np.random.uniform(-1, 1, (4,7))
b = np.cov(a)
x, y = np.linalg.eig(b)
z = np.dot(y, np.diag(np.sqrt(x)))
assert b.all() == np.dot(z, z.T).all()

Q.3

PCA helps in dimensionality reduction in the following ways:

Q.4

Steps for doing PCA on a matrix X of dimensions NxD where N documents and D features:

  • Zero-center each column Xj by doing Xj = Xj - mean(Xj).
  • Compute feature-feature covariance matrix Y = \frac{1}{N-1}X^TX
  • Compute the eigenvalues \lambda and eigenvectors V of the matrix Y.
  • For selecting k components with the highest variance, sort the eigenvectors (columns of V) in decreasing order of the eigenvalues and take the first k columns from V. Lets denote it be V_k.
  • The dimensionality reduced matrix is X_k=X.V_k
  • Python code for selecting k components with the highest variances:
def compute_pca_comp(mat, num_components):
    mat_c = mat-np.mean(mat, axis=0)
    cov = np.cov(mat_c.T)
    eig_vals, eig_vecs = np.linalg.eig(cov)
    eig_vals = eig_vals.real
    eig_vecs = eig_vecs.real
    h = np.argsort(eig_vals)[::-1]
    var = np.sum(eig_vals[h[:min(num_components, len(h))]])/float(np.sum(eig_vals))
    eig_vecs_k = eig_vecs[:,h[:min(num_components, len(h))]]
    return np.dot(mat_c, eig_vecs_k), var

It also returns the fraction of the variance captured by the k-components.

With k = min(N, D), 100% of the variance is captured. Thus we should always choose k <= min(N, D).

Python code for selecting k components where k captures at-least 95% of total variance:

def bsearch(g, k):
    left, right = 0, len(g)-1
    while left < right:
        mid = int((left+right)/2)
        if g[mid] < k:
            left = mid+1
        else:
            right = mid
    return left

def compute_pca_var(mat, var_thres=0.95):
    mat_c = mat-np.mean(mat, axis=0)
    cov = np.cov(mat_c.T)
    eig_vals, eig_vecs = np.linalg.eig(cov)
    eig_vals = eig_vals.real
    eig_vecs = eig_vecs.real
    
    h = np.argsort(eig_vals)[::-1]
    cumsum = np.cumsum(eig_vals[h])/float(np.sum(eig_vals))
    k = bsearch(cumsum, var_thres)
    eig_vecs_k = eig_vecs[:,h[:min(k, len(h))]]
    return np.dot(mat_c, eig_vecs_k), k

n, d, num_components = 500, 1000, 128
mat = np.random.uniform(0, 1, (n,d))
out1, var = compute_pca_comp(mat, num_components)
print var
out2, comp = compute_pca_var(mat, 0.95)
print comp

When the number of examples is much less than the number of features, it is faster to compute the eigenvalues and eigenvectors of:

Z=\frac{1}{N-1}XX^T.

Because since N << D and XX^T is NxN whereas X^TX is DxD.

Let the S be the diagonal matrix of the sorted (decreasing order) eigenvalues of Z and the eigenvectors matrix be U.

Let S_k be the kxk sub-matrix (top-left) of S of the first k eigenvalues and U_k be the first k columns of U corresponding to the eigenvalues S_k. Then the PCA reduced matrix is given as:

X_k=(N-1)^\frac{1}{2}U_k.S^{\frac{1}{2}}_k.

The above equation can be derived from the SVD equation : X=USV^T

Earlier we were finding XV which is equal to US because XV=USV^TV=US and S=\sqrt{(N-1)*D}

For a matrix of size 50x3000, computing PCA with the earlier method takes 17 seconds whereas with the below code it takes only 0.004 seconds.

def compute_pca_comp_2(mat, num_components):
    mat_c = mat-np.mean(mat, axis=0)
    x = np.dot(mat_c, mat_c.T)/float(mat_c.shape[0]-1)
    eig_vals, eig_vecs = np.linalg.eig(x)
    eig_vals = np.abs(eig_vals.real)
    eig_vecs = eig_vecs.real
    eig_vals = np.sqrt(eig_vals*(mat_c.shape[0]-1))
    
    h = np.argsort(eig_vals)[::-1]
    h1 = h[:min(num_components, len(h))]
    var = np.sum(eig_vals[h1])/float(np.sum(eig_vals))
    eig_vecs_k = eig_vecs[:,h1]
    
    return np.dot(eig_vecs_k, np.diag(eig_vals[h1])), var

Q.5

SVD of matrix A is given by A=USV^T.

where columns of U are the eigenvectors of AA^T, rows of VT are the eigenvectors of A^TA and S is a diagonal matrix where the diagonals are the square root of the eigenvalues of A^TA.

PCA with k components can be computed by computing A_k=AV_k where V_k is the first k columns of matrix V.

or, PCA with k components can be computed by computing A_k=U_kS_k where U_k is the first k columns of matrix U and S_k is the top-left kxk sub-matrix from S.

a = np.random.uniform(0, 1, (4,7))
a = a - np.mean(a, axis=0)
u, s, vh = np.linalg.svd(a)
x1 = np.dot(u[:,:2], np.diag(s[:2]))
x2 = np.dot(a, vh.T[:,:2])
assert x1.all() == x2.all()

Space complexity of SVD is O(ND) and time complexity is O(min(N2D, ND2)).

https://arxiv.org/pdf/1906.12085.pdf

Q.6

Following steps can be followed to develop a semantic search engine:

  • Create a TF-IDF document term matrix X with all the documents to be searched.
  • Compute the SVD decomposition of X, X=USV^T.
  • Choose the rank of the final matrix to be retained, let it be K, compute X_k=U_kS_k where U_k is the first K columns of matrix U and S_k is the top-left KxK sub-matrix from S.
  • Given a query Q, create the TF-IDF vector Y.
  • Compute Y_k=YV_k where V_k is the first k columns of matrix V.
  • X_k is of dimensions NxK and Y_k is of dimensions 1xK.
  • Compute the dot product Z=Y_kX^T_k gives a vector of size N where Z_i is the relevance of the i-th document to the query Q.

Q.7

This is an example of an overdetermined system where the number of examples is greater than the number of features. Such a scenario can occur in linear regression problems where we have lots of data but only few features.

Let's say that the feature matrix be represented as A and the unknowns with X and the regression labels as Y. A is of shape NxD and X is of shape Dx1 and Y is of shape Nx1

Then we need to solve for X from AX = Y

X = A^{-1}Y

But the matrix A is not square hence not invertible.

In such cases we find pseudo-inverse A+ such that A+A ~ I(N), where I(N) is an identity matrix of size NxN.

We find A+ by using SVD decomposition of A, where A=USV^T

A^{+}=VS^{+}U^T

where S^{+} is the diagonal matrix of the inverse of non-zero singular values S.

A^{+}A=VS^{+}U^TUSV^T=VS^{+}SV^T=VV^T=I

Python code for finding pseudo-inverse:

a = np.random.uniform(0, 1, (7,5))
b = np.random.uniform(0, 1, (7,1))
u, s, vh = np.linalg.svd(a)
s_plus = np.zeros((a.shape[0], a.shape[1])).T
s_plus[:s.shape[0], :s.shape[0]] = np.linalg.inv(np.diag(s))

a_plus = vh.T.dot(s_plus).dot(u.T)
x = np.dot(a_plus, b)

Pseudo-inverse can also be found without SVD using the equation: A^{+}=(A^TA)^{-1}A^T.

np.linalg.inv(a.T.dot(a)).dot(a.T)

This is what is commonly used to find solution in linear regression.

Q.8

Generate random graph using Erdos-Renyi model, create adjacency matrix for the graph and then compute the PageRank of each node using the following codes:

Create Random Graph

from networkx.generators.random_graphs import erdos_renyi_graph
import networkx as nx
import matplotlib.pyplot as plt 
n = 10
p = 0.2
g = erdos_renyi_graph(n, p, directed=True)
nx.draw(g, with_labels = True)
plt.show()

Random Directed Graph generated using Erdos-Renyi Model (10, 0.2)

Create adjacency matrix:

adjacency_matrix = np.zeros((n, n))
for x, y in g.edges:
    adjacency_matrix[x][y] = 1
    adjacency_matrix[y][x] = 1

Computing PageRank vector:

num_iter=500
d = 0.85

out_degrees = np.sum(adjacency_matrix, axis=1)
m = np.linalg.pinv(np.diag(out_degrees)).dot(adjacency_matrix).T
v = np.random.rand(n, 1)
v = v / np.linalg.norm(v, 1)

for i in range(num_iter):
    v = d*m.dot(v)+(1.0-d)/float(n)
array([[0.08486854],
       [0.03670438],
       [0.19418566],
       [0.09610343],
       [0.03059936],
       [0.02539957],
       [0.06926918],
       [0.05626445],
       [0.03670438],
       [0.05106913]])

PageRank for node 2 is maximum because there are many incoming edges to node 2 thus it is deemed important whereas PageRank for node 5 is minimum because it has only only incoming edge to it and also one outgoing edge.

Although node 7 also has one incoming edge but that edge comes from node 2 which is very important and there are no outgoing edges from 7 thus node 7 is more important than node 5.

When the graph size is very large such as 100 billion nodes, instead of matrix operations we can use memory effecient maps because the graph would be sparse and the memory required to hold the entire matrix with lots of 0s would be inefficient.

adjacency_matrix = {}
for x, y in g.edges:
    if x not in adjacency_matrix:
        adjacency_matrix[x] = []
    adjacency_matrix[x].append(y)

num_iter=500
d = 0.85

out_degrees = {}
for x, adj_lst in adjacency_matrix.items():
    out_degrees[x] = len(adj_lst)
        
v = np.random.rand(n, 1)
v = v / np.linalg.norm(v, 1)

for i in range(num_iter):
    u = np.zeros(n)
    for x, adj_list in adjacency_matrix.items():
        for y in adj_list:
            u[y] += d*v[x]/float(out_degrees[x])
    
    for j in range(len(v)):
        u[j] += (1.0-d)/float(n)
    
    v = np.copy(u)

We can parallelize the computations for the nested for-loops into a map-reduce operation. This would be same result as the above code with the matrix operations. But this would be slow to compute but with very less memory requirement (when the graph is very sparse as hyperlinks on the internet).

Q.9

Using TruncatedSVD for doing PCA with large sparse matrices i.e. randomized SVD calculation instead of full SVD.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , ,

Leave a Reply