Linear Algebra, PCA and SVD

- What are Symmetric Positive Definite Matrices ? Explain with an example and application.
- Write code in python to generate a random SPD matrix

- How would you find the square root of a covariance matrix ?
- What is the purpose of PCA in machine learning ? How does it help in dimensionality reduction ?
- Given a document-term matrix, give step by step process of how would would find the PCA of the matrix ?
- How would you choose the final dimensionality of the transformed matrix ?
- What if the number of examples is much less than the number of features ?

- Given a document-term matrix, how would you obtain the principal components using SVD technique ?
- What is the space and time complexity of SVD and PCA ?

- Explain how SVD (or Latent Semantic Analysis) works in the case of semantic search ?
- You have 100 unknown variables and 200 linear equations to be solved. How can you solve for the unknowns ?
- Why is Moore-Penrose PseudoInverse such an important technique ?

- How can you compute the PageRanks of 100 billion web-pages efficiently ?
- How to perform PCA on a very large sparse matrix ?
- How is TruncatedSVD implemented in sklearn ?

Answers and Hints

Q.1

Let A be a Symmetric Positive Definite (SPD) matrix, then all the eigenvalues of A are positive. Also for any non-zero vector 'x', x^{T}Ax > 0. Covariance matrix is SPD.

Python code to generate random SPD matrix

a = np.random.uniform(-1, 1, (7,4)) b = np.dot(a.T, a)

Q.2

Let A be a NxM matrix, then the covariance matrix of A is given by:

, where is the mean of each row of A

C can be decomposed into the form where V is the matrix where the columns are the eigenvectors of C and D is a diagonal matrix where the diagonals are the eigenvalues of C.

The square root of C is computed to be .

a = np.random.uniform(-1, 1, (4,7)) b = np.cov(a) x, y = np.linalg.eig(b) z = np.dot(y, np.diag(np.sqrt(x))) assert b.all() == np.dot(z, z.T).all()

Q.3

PCA helps in dimensionality reduction in the following ways:

- Helps remove correlated and linearly dependent features.
- Final list of features linear combination of existing features.
- Only K features with the highest variances are kept. Eigenvalues of covariance matrix are used to select highest variance features.
- http://pages.cs.wisc.edu/~jerryzhu/cs540/handouts/PCA.pdf
- http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
- https://medium.com/@jonathan_hui/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491

Q.4

Steps for doing PCA on a matrix X of dimensions NxD where N documents and D features:

- Zero-center each column X
_{j}by doing X_{j}= X_{j}- mean(X_{j}). - Compute feature-feature covariance matrix
- Compute the eigenvalues and eigenvectors V of the matrix Y.
- For selecting k components with the highest variance, sort the eigenvectors (columns of V) in decreasing order of the eigenvalues and take the first k columns from V. Lets denote it be .
- The dimensionality reduced matrix is
- Python code for selecting k components with the highest variances:

def compute_pca_comp(mat, num_components): mat_c = mat-np.mean(mat, axis=0) cov = np.cov(mat_c.T) eig_vals, eig_vecs = np.linalg.eig(cov) eig_vals = eig_vals.real eig_vecs = eig_vecs.real h = np.argsort(eig_vals)[::-1] var = np.sum(eig_vals[h[:min(num_components, len(h))]])/float(np.sum(eig_vals)) eig_vecs_k = eig_vecs[:,h[:min(num_components, len(h))]] return np.dot(mat_c, eig_vecs_k), var

It also returns the fraction of the variance captured by the k-components.

With k = min(N, D), 100% of the variance is captured. Thus we should always choose k <= min(N, D).

Python code for selecting k components where k captures at-least 95% of total variance:

def bsearch(g, k): left, right = 0, len(g)-1 while left < right: mid = int((left+right)/2) if g[mid] < k: left = mid+1 else: right = mid return left def compute_pca_var(mat, var_thres=0.95): mat_c = mat-np.mean(mat, axis=0) cov = np.cov(mat_c.T) eig_vals, eig_vecs = np.linalg.eig(cov) eig_vals = eig_vals.real eig_vecs = eig_vecs.real h = np.argsort(eig_vals)[::-1] cumsum = np.cumsum(eig_vals[h])/float(np.sum(eig_vals)) k = bsearch(cumsum, var_thres) eig_vecs_k = eig_vecs[:,h[:min(k, len(h))]] return np.dot(mat_c, eig_vecs_k), k n, d, num_components = 500, 1000, 128 mat = np.random.uniform(0, 1, (n,d)) out1, var = compute_pca_comp(mat, num_components) print var out2, comp = compute_pca_var(mat, 0.95) print comp

When the number of examples is much less than the number of features, it is faster to compute the eigenvalues and eigenvectors of:

.

Because since N << D and is NxN whereas is DxD.

Let the S be the diagonal matrix of the sorted (decreasing order) eigenvalues of Z and the eigenvectors matrix be U.

Let be the kxk sub-matrix (top-left) of S of the first k eigenvalues and be the first k columns of U corresponding to the eigenvalues . Then the PCA reduced matrix is given as:

.

The above equation can be derived from the SVD equation :

Earlier we were finding which is equal to because and

For a matrix of size 50x3000, computing PCA with the earlier method takes 17 seconds whereas with the below code it takes only 0.004 seconds.

def compute_pca_comp_2(mat, num_components): mat_c = mat-np.mean(mat, axis=0) x = np.dot(mat_c, mat_c.T)/float(mat_c.shape[0]-1) eig_vals, eig_vecs = np.linalg.eig(x) eig_vals = np.abs(eig_vals.real) eig_vecs = eig_vecs.real eig_vals = np.sqrt(eig_vals*(mat_c.shape[0]-1)) h = np.argsort(eig_vals)[::-1] h1 = h[:min(num_components, len(h))] var = np.sum(eig_vals[h1])/float(np.sum(eig_vals)) eig_vecs_k = eig_vecs[:,h1] return np.dot(eig_vecs_k, np.diag(eig_vals[h1])), var

- https://stats.stackexchange.com/questions/147880/is-pca-still-done-via-the-eigendecomposition-of-the-covariance-matrix-when-dimen
- http://pages.cs.wisc.edu/~jerryzhu/cs540/handouts/PCA.pdf
- http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
- http://theory.stanford.edu/~tim/s15/l/l9.pdf

Q.5

SVD of matrix A is given by .

where columns of U are the eigenvectors of , rows of V^{T} are the eigenvectors of and S is a diagonal matrix where the diagonals are the square root of the eigenvalues of .

PCA with k components can be computed by computing where is the first k columns of matrix V.

or, PCA with k components can be computed by computing where is the first k columns of matrix U and is the top-left kxk sub-matrix from S.

a = np.random.uniform(0, 1, (4,7)) a = a - np.mean(a, axis=0) u, s, vh = np.linalg.svd(a) x1 = np.dot(u[:,:2], np.diag(s[:2])) x2 = np.dot(a, vh.T[:,:2]) assert x1.all() == x2.all()

Space complexity of SVD is O(ND) and time complexity is O(min(N^{2}D, ND^{2})).

https://arxiv.org/pdf/1906.12085.pdf

Q.6

Following steps can be followed to develop a semantic search engine:

- Create a TF-IDF document term matrix X with all the documents to be searched.
- Compute the SVD decomposition of X, .
- Choose the rank of the final matrix to be retained, let it be K, compute where is the first K columns of matrix U and is the top-left KxK sub-matrix from S.
- Given a query Q, create the TF-IDF vector Y.
- Compute where is the first k columns of matrix V.
- is of dimensions NxK and is of dimensions 1xK.
- Compute the dot product gives a vector of size N where is the relevance of the i-th document to the query Q.

Q.7

This is an example of an overdetermined system where the number of examples is greater than the number of features. Such a scenario can occur in linear regression problems where we have lots of data but only few features.

Let's say that the feature matrix be represented as A and the unknowns with X and the regression labels as Y. A is of shape NxD and X is of shape Dx1 and Y is of shape Nx1

Then we need to solve for X from AX = Y

But the matrix A is not square hence not invertible.

In such cases we find pseudo-inverse A^{+} such that A^{+}A ~ I(N), where I(N) is an identity matrix of size NxN.

We find A^{+} by using SVD decomposition of A, where

where is the diagonal matrix of the inverse of non-zero singular values S.

Python code for finding pseudo-inverse:

a = np.random.uniform(0, 1, (7,5)) b = np.random.uniform(0, 1, (7,1)) u, s, vh = np.linalg.svd(a) s_plus = np.zeros((a.shape[0], a.shape[1])).T s_plus[:s.shape[0], :s.shape[0]] = np.linalg.inv(np.diag(s)) a_plus = vh.T.dot(s_plus).dot(u.T) x = np.dot(a_plus, b)

Pseudo-inverse can also be found without SVD using the equation: .

np.linalg.inv(a.T.dot(a)).dot(a.T)

This is what is commonly used to find solution in linear regression.

- https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.9-The-Moore-Penrose-Pseudoinverse/
- https://www.math.ucla.edu/~laub/33a.2.12s/mppseudoinverse.pdf

Q.8

Generate random graph using Erdos-Renyi model, create adjacency matrix for the graph and then compute the PageRank of each node using the following codes:

Create Random Graph

from networkx.generators.random_graphs import erdos_renyi_graph import networkx as nx import matplotlib.pyplot as plt n = 10 p = 0.2 g = erdos_renyi_graph(n, p, directed=True) nx.draw(g, with_labels = True) plt.show()

Create adjacency matrix:

adjacency_matrix = np.zeros((n, n)) for x, y in g.edges: adjacency_matrix[x][y] = 1 adjacency_matrix[y][x] = 1

Computing PageRank vector:

num_iter=500 d = 0.85 out_degrees = np.sum(adjacency_matrix, axis=1) m = np.linalg.pinv(np.diag(out_degrees)).dot(adjacency_matrix).T v = np.random.rand(n, 1) v = v / np.linalg.norm(v, 1) for i in range(num_iter): v = d*m.dot(v)+(1.0-d)/float(n)

array([[0.08486854], [0.03670438], [0.19418566], [0.09610343], [0.03059936], [0.02539957], [0.06926918], [0.05626445], [0.03670438], [0.05106913]])

PageRank for node 2 is maximum because there are many incoming edges to node 2 thus it is deemed important whereas PageRank for node 5 is minimum because it has only only incoming edge to it and also one outgoing edge.

Although node 7 also has one incoming edge but that edge comes from node 2 which is very important and there are no outgoing edges from 7 thus node 7 is more important than node 5.

When the graph size is very large such as 100 billion nodes, instead of matrix operations we can use memory effecient maps because the graph would be sparse and the memory required to hold the entire matrix with lots of 0s would be inefficient.

adjacency_matrix = {} for x, y in g.edges: if x not in adjacency_matrix: adjacency_matrix[x] = [] adjacency_matrix[x].append(y) num_iter=500 d = 0.85 out_degrees = {} for x, adj_lst in adjacency_matrix.items(): out_degrees[x] = len(adj_lst) v = np.random.rand(n, 1) v = v / np.linalg.norm(v, 1) for i in range(num_iter): u = np.zeros(n) for x, adj_list in adjacency_matrix.items(): for y in adj_list: u[y] += d*v[x]/float(out_degrees[x]) for j in range(len(v)): u[j] += (1.0-d)/float(n) v = np.copy(u)

We can parallelize the computations for the nested for-loops into a map-reduce operation. This would be same result as the above code with the matrix operations. But this would be slow to compute but with very less memory requirement (when the graph is very sparse as hyperlinks on the internet).

Q.9

Using TruncatedSVD for doing PCA with large sparse matrices i.e. randomized SVD calculation instead of full SVD.

- https://arxiv.org/pdf/0909.4061.pdf
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: Linear Algebra, Machine Learning Interviews, Moore Penrose PseudoInverse, PageRank, PCA, SVD++

### Leave a Reply

You must be logged in to post a comment.