Stokastik

Machine Learning, AI and Programming

Designing Movie Recommendation Engines - Part II

In the last post, we had started to design a movie recommendation engine using the 20 million ratings dataset available from MovieLens. We started with a Content Based Recommendation approach, where we built a classification/regression model for each user based on the tags and genres assigned to each movie he has rated.

The assumption behind this approach is that, the rating that an user has given to a movie depends on the characteristics of the movie such as the genre, actors, year of release etc. and his affinity towards those characteristics.

But it is not always so, as we have found out that, users tend to watch and rate movies similar to other users who have watched and rated multiple movies in common between them, probably due to some social network effect.

Approaches that provide recommendations based on similarity between users or movies such as Collaborative Filtering performs better than the Content Based approach.

In this post, we would be looking into the Matrix Factorization approach for recommendations. In the MF approach, we discover the common latent factors that movies have/don't have and users prefer/don't prefer. These are called latent factors because, we don't have explicit information about these factors but only inferred from ratings. For e.g. if a movie is tagged as a "Horror" movie and an user has specified that he likes "Horror" movies in his preferences, then "Horror" factor is not latent as it is explicitly stated by both the movies and the users. The latent factors may or may not be interpretable easily.

Latent Factors in 2 Dimensions (Male/Female and Serious/Lighthearted).

For e.g. let their be a set of 3 movies that a set of 5 users who have watched all of them. Then their must be some common factors between these movies as well as these users. The common factors could be Johnny Depp, sci-fi, sequels, age group etc. Then the users might like Johnny Depp, like sci-fi, have watched the prequels, belong to the same age-group and so on.

As you see that we have not explicitly introduced these factors but they could be learnt by the model. But again the interpretation of the factors could completely be wrong, and the reason that the users have watched all 3 of these movies could be that they are friends or roommates.

Thus our model should not depend on whether we are able to interpret these factors correctly, but only on whether the factors are able to explain the ratings approximately well.

Given the ratings matrix R of dimension NxM (where N is the number of users and M is the number of movies), the MF approach tries to find two matrices P, of dimension NxK and Q, of dimension MxK such that:

R = PQT, where K is the number of latent factors.

P is the matrix of user vs. latent factors, i.e. how strongly an user is aligned towards these factors. Q is the matrix of movie vs. latent factors, i.e. how strongly a movie captures these factors.

Decomposition of Ratings Matrix into Latent Factor Components.

Note that we do not want the learnt matrices P and Q to exactly reproduce R, because then the unrated entries will remain unrated or become 0's. We want the product to approximate R with some R' :

PQ^T{\approx}R'

Once we have computed the matrices P and Q, then the unknown entries \hat{r}_{ij} can be computed as follows :

\hat{r}_{ij}={\sum}_{k=1}^Kp_{ik}*q_{kj},

where p_{ik} is the (i, k)-th entry in matrix P and q_{kj} is the (k, j)-entry in matrix QT.

The MF approach is somewhat similar to the Content Based Linear-Regression approach.

Recall that in the linear regression approach, we created a movie-tag matrix X, of dimensions MxD, where D is the number of tags, and then we fitted a linear regression model for each user. For the i-th user, the ratings were given as :

r_{ij}=a_{i0}+a_{i1}*x_{1j}+a_{i2}*x_{2j}+...+a_{iD}*x_{Dj} = {\sum}_{k=1}^Ka_{ik}*x_{kj}

where r_{ij} is the rating given by user i to movie j, x_{kj} is the entry for the (k, j)-th cell in the movie tag matrix X and a_{ik} are the weights corresponding to the tag k for the user i.

Note that this expression is very similar to the expression for \hat{r}_{ij} with latent matrices P and Q, where the matrix X corresponds to Q and the weights a_{ik} corresponds to P. Only difference is that, in the linear regression approach we already knew X and only had to learn the weights a_{ik}. Whereas with MF, we neither know P nor Q and need to learn both simultaneously.

Thus now we have (N + M)*K number of unknown variables to determine. The number of linear equations to solve is equal to the number of known ratings in the matrix R. Thus this is an underdetermined system because the number of equations to solve is less than the number of unknown variables.

If there were no unknown ratings in the matrix R, then we could have solved the linear system of equations exactly. But a ratings matrix without any unknown ratings is of no use to us unless we want to obtain only the latent factors. Assuming that the ratings matrix is full or we fill the unknown ratings with the average of known ratings for that movie (imputation), then one can obtain the latent factor matrices P and Q using the Singular Value Decomposition (SVD) technique.

Given a matrix A of dimensions NxD, SVD decomposes A as follows:

A = USVT

where U is a NxN matrix, S is a NxM diagonal matrix and V is a MxM matrix. Both U and V are orthogonal matrices. The diagonal matrix S contains the singular values of A along its diagonal. If the rank of matrix A is r <= min(N, M), then only the first r values along the diagonal of S is non-zero, rest all are zero. The singular values along the diagonal is ordered is descending order, i.e.

S00 >= S11 >= ... >= Srr

If we compute the following AAT = USVTVSTUT, since V is orthogonal, thus VTV = I and thus AAT = USSTUT,  SST is nothing but the squares of the singular values [S200 , S211 , ... , S2rr].

Recall that any square symmetric and real valued matrix X of dimensions MxM can be written as X = QLQT, where the columns of Q are the eigenvectors of X and the diagonal matrix L contains the eigenvalues of X along its diagonal. Now since AAT is also a square symmetric matrix and AAT = UYUT, where Y = SST, it implies that columns of U contains the eigenvectors of AAT and the singular values of A are the square root of the eigenvalues of AAT .

Similarly one can compute ATA = VSTUTUSVT = VYVT, and show that the columns of V contains the eigenvectors of ATA.

Note that since the rank of matrix A is r, it implies that the maximum number of latent factors possible (linearly independent eigenvectors) is r. Thus if we consider S to be a rxr diagonal matrix, take only the first r columns of U i.e. Ur is now of dimension Nxr and take the first r columns from V, i.e. VrT  is now of dimension rxM, then A can be written as:

A = UrSrVrT

Now if we consider P = UrSr1/2 and Q = VrSr1/2, then P is of dimension Nxr and Q is of dimension Mxr, then A can be written as A = PQT.

Thus our ratings matrix R can be factorized as R = PQT, where the user-latent factor matrix P = UrSr1/2 and movie-latent factor matrix Q = VrSr1/2 where Ur, Sr and Vr are the components of the SVD decomposition of R, taking only the r non-zero singular values (latent factors).

Generating a random normal matrix N(0,1) of dimensions 4x5 in python:

import numpy as np
a = np.random.randn(4, 5)
print a
[[ 1.08910142 -1.22736514  0.11627635  0.60788857 -0.20142395]
[ 0.99217829 -1.55027301 -0.22738402  0.44763835 -0.05271375]
[ 0.34639318  0.71561485 -0.0739815  -0.32267602  0.19273828]
[ 0.03592374 -0.48508989 -0.26491703 -0.87844515  0.91048128]]

Then compute the components Ur, Sr and Vr using the SVD of the matrix a:

u, s, vt = np.linalg.svd(a, full_matrices=False)
print u
print s
print vt
u
[[-0.66244007 -0.14620624 -0.28178952  0.67852159]
[-0.7241735   0.11299258  0.00575221 -0.68027372]
[ 0.18772633  0.1214737  -0.95642687 -0.18775145]
[-0.03879077  0.97524384  0.07621824  0.20392524]]
s
[2.6195574  1.41049797 0.74391795 0.20018552]
vt
[[-0.52540938  0.79741686  0.03207695 -0.28758935  0.06583898]
[ 0.02126017 -0.27073661 -0.21980786 -0.66231373  0.66277803]
[-0.84653401 -0.51681081  0.02217042  0.09804947 -0.07862272]
[ 0.03155047 -0.05726469  0.96633489 -0.05297752  0.24313698]]

Note that the rank of the matrix is 4 because min(4, 5) = 4.

Then compute the matrices P and Q as follows:

s1 = np.diag(s)
p = np.dot(u, s1**0.5)
q = np.dot(s1**0.5, vt)
print p
print q
P
[[-1.07216236 -0.17364095 -0.24304538  0.30358478]
[-1.17207819  0.13419494  0.00496132 -0.30436872]
[ 0.30383595  0.1442675  -0.82492466 -0.08400394]
[-0.06278304  1.15824237  0.06573875  0.09124043]]
Q
[[-0.85037754  1.29062292  0.0519167  -0.4654647   0.10656069]
[ 0.02524951 -0.32153868 -0.26105346 -0.78659284  0.78714427]
[-0.73014133 -0.44575283  0.01912213  0.08456833 -0.06781263]
[ 0.01411634 -0.02562142  0.43235849 -0.02370325  0.10878458]]

We can verify that PQT gives back our original matrix a:

np.dot(p, q)
array([[ 1.08910142, -1.22736514,  0.11627635,  0.60788857, -0.20142395],
[ 0.99217829, -1.55027301, -0.22738402,  0.44763835, -0.05271375],
[ 0.34639318,  0.71561485, -0.0739815 , -0.32267602,  0.19273828],
[ 0.03592374, -0.48508989, -0.26491703, -0.87844515,  0.91048128]])

But there are several challenges with this approach:

  1. Computing SVD requires knowing the full matrix R. If we fill the unknown ratings with imputed values from average ratings, the predicted ratings are not that good.
  2. The complexity of SVD computation is O(min(N2M, NM2)). Given that N (number of users) could be in millions and M (number of movies) also could be in several thousands, computing SVD on the full imputed matrix is wastage of space and running time.
  3. Since the rank of the full matrix R' could be in several thousands (20-50K), but the number of "useful" latent factors could be much less ~500, we don't need to compute the latent factor matrices with full rank, but reduced rank k = 500.
    • The optimal value of reduced rank k << r can be chosen using the following criteria:
    • = 0.95" />,    ATruncated = UkSkVkT
    • This is known as Truncated SVD.

Python functions for getting a matrix of reduced rank after SVD.

Note that although the rank of the new matrix is less than r i.e.  < min(N, M), but the dimensions of the new matrix is the same as the original matrix.

def get_truncated_rank(s, acceptance_ratio=0.95):
s_sum_sq = np.sum(s**2)
r = min(a.shape[0], a.shape[1])
for k in range(1, r + 1):
sk = s[:k]
sk_sum_sq = np.sum(sk**2)
if float(sk_sum_sq)/s_sum_sq >= acceptance_ratio:
return k
return r
def get_reduced_rank_matrix(a):
u, s, vt = np.linalg.svd(a, full_matrices=False)
k = get_truncated_rank(s)
print k
uk, sk, vkt = u[:,range(k)], np.diag(s[:k]), vt[range(k),:]
return np.dot(np.dot(uk, sk), vkt)
Original 10x5 matrix of rank 5
[[-0.49068061 -0.9511035  -0.29747854  0.38704866 -0.5231572 ]
[ 0.12211881  0.70603132 -1.10839293 -0.311787    0.11332492]
[ 1.56766408 -0.57486759  1.25176635 -0.29809648  0.47678418]
[-1.09934274  0.43489352  0.11940306 -0.16246338 -0.75328974]
[-0.29756537 -0.98728997 -0.29510109  0.93565654 -1.39858189]
[-0.67094911 -0.91362783  1.14857474  0.12041214  0.45741434]
[ 0.34530484 -1.23295696  0.45495699  0.04401526 -0.91377875]
[-0.30555599 -0.64147676 -0.16349685  0.70730611 -1.22413623]
[-1.57833146 -0.76579262  0.63032095  0.11872791  0.82668229]
[-0.540997    0.35148116 -0.80480644  0.97053772 -0.78488335]]
Reduced rank matrix of rank 4
[[-0.46150266 -0.83624914 -0.19300256  0.62281513 -0.48798781]
[ 0.15803602  0.84741373 -0.97978602 -0.02156528  0.15661741]
[ 1.547334   -0.65489369  1.1789715  -0.4623695   0.4522795 ]
[-1.08705211  0.48327364  0.1634115  -0.06315143 -0.73847533]
[-0.3041663  -1.0132735  -0.31873672  0.88231904 -1.40653827]
[-0.68520739 -0.96975328  1.09752085  0.00520101  0.44022825]
[ 0.38303915 -1.08442184  0.59007028  0.34891964 -0.86829603]
[-0.31413073 -0.67522989 -0.1942      0.63801962 -1.23447172]
[-1.57437555 -0.75022082  0.64448566  0.15069282  0.83145051]
[-0.59155926  0.15245084 -0.98585213  0.56197966 -0.84582813]]

Python function for getting a dimension reduced matrix (similar to PCA dimensionality reduction).

def reduce_dimension(a, k=10):
u, s, _ = np.linalg.svd(a, full_matrices=False)
uk, sk = u[:,range(k)], np.diag(s[:k])
return np.dot(uk, sk)
print reduce_dimension(a, 2)
Dimensionality Reduced matrix. (of dimensions 10x2)
[[-1.12417244 -0.41135465]
[ 0.23754468  1.26301679]
[ 1.29629456 -1.35157959]
[-0.78276953  0.33160769]
[-1.84564669 -0.35711333]
[-0.10744961 -1.48758873]
[-0.74887017 -1.09900671]
[-1.49468627 -0.21703858]
[-0.3753158  -1.04722398]
[-1.25498373  0.88466669]]

With missing ratings, instead of doing SVD with imputations, we can directly work with only observed ratings. Similar to finding solutions for underdetermined linear system of equations, we will use least squares fitting method for finding the unknowns.

i.e. for the observed true ratings r_{ij}, find the values of p_{ij} and q_{ij}, such that:

L = \sum_{ij}(r_{ij}-{\sum}_{k=1}^Kp_{ik}*q_{kj})^2 is minimized.

Note that in the above loss function, the summation \sum_{ij} is over only those pairs (i, j) for which there is an observed rating.

This is the loss function that we want to minimize. As with most machine learning systems, in order to reduce overfitting, we must add regularization to the above loss:

L = \sum_{ij}(r_{ij}-{\sum}_{k=1}^Kp_{ik}*q_{kj})^2 + {\lambda} * ({\sum}_{k=1}^Kp^2_{ik} + {\sum}_{k=1}^Kq^2_{kj}), after adding L2 regularization.

where {\lambda} is a constant for the regularization term.

One can solve the above loss minimization problem using the Gradient Descent technique:

p^{(t+1)}_{ik}=p^{(t)}_{ik} - {\eta}*\frac{{\partial}L}{{\partial}p_{ik}}   and   q^{(t+1)}_{kj}=q^{(t)}_{kj} - {\eta}*\frac{{\partial}L}{{\partial}q_{kj}}

Define e_{ij}=r_{ij}-{\sum}_{w=1}^Kp_{iw}*q_{wj}

The partial derivatives of the loss w.r.t. the unknowns (the constant 2 has been omitted):

\frac{{\partial}L}{{\partial}p_{ik}}=-\sum_{j}e_{ij}*q_{kj} + {\lambda}*p_{ik}   and   \frac{{\partial}L}{{\partial}q_{kj}}=-\sum_{i}e_{ij}*p_{ik} + {\lambda}*q_{kj}

In the first equation, the summation \sum_{j} is over all movies rated by user i and in the second equation, the summation \sum_{i} is over all users who have rated movie j.

In Stochastic Gradient Descent (SGD), instead of iterating over all movie ratings in each epoch, we randomly sample a single movie rating, and then update the corresponding row p_i from P and column q_j from Q that gives the corresponding rating.

r_{ij}={\sum}_{k=1}^Kp_{ik}*q_{kj}

The final update equations are as follows (the superscript t denotes the epoch number):

p^{(t+1)}_{ik}=p^{(t)}_{ik} + {\eta}*(e^{(t)}_{ij}*q^{(t)}_{kj} - {\lambda}*p^{(t)}_{ik})   and   q^{(t+1)}_{kj}=q^{(t)}_{kj} + {\eta}*(e^{(t)}_{ij}*p^{(t)}_{ik} - {\lambda}*q^{(t)}_{kj})

Instead of sampling only a single rating in each epoch, we can randomly sample a batch of ratings for efficiency purpose. This is known as Batch Gradient Descent.

With our 20 million ratings from the MovieLens database, first we prepare the training and validation data. We randomly selected 5000 ratings for validation and kept the remaining for training purposes.

from collections import defaultdict
import numpy as np
import os, random
import pandas as pd
dataset_path = "ml-20m"
ratings_df = pd.read_csv(os.path.join(dataset_path, "ratings.csv"), encoding="utf-8", sep=",")
user_id, movie_id, ratings = list(ratings_df[u'userId']), list(ratings_df[u'movieId']), list(ratings_df[u'rating'])
uid_mid_pairs = zip(user_id, movie_id, ratings)
random.shuffle(uid_mid_pairs)
validation_data, training_data = uid_mid_pairs[:5000], uid_mid_pairs[5000:]

For our P and Q matrices, we need a way to map the user ids and the movie ids to corresponding rows in the matrices P and Q respectively.

uid_map, mid_map = dict(), dict()
user_id, movie_id = sorted(list(set(user_id))), sorted(list(set(movie_id)))
for idx in range(len(user_id)):
uid_map[user_id[idx]] = idx
for idx in range(len(movie_id)):
mid_map[movie_id[idx]] = idx

Next we define the function for getting the errors e_{ij} from the predicted model, i.e. given a P and a Q matrix, and the set of ratings for which we want to compute the error, the function computes the predicted rating \hat{r}_{ij} as:

\hat{r}_{ij}={\sum}_{k=1}^Kp_{ik}*q_{kj}

Then we simply subtract this from the true rating to get the error: e_{ij}=r_{ij}-\hat{r}_{ij}.

def get_ratings_errors(p, q, uid_map, mid_map, mydata):
p_idx, q_idx, true_ratings = map(list, zip(*mydata))
p_idx = [uid_map[x] for x in p_idx]
q_idx = [mid_map[x] for x in q_idx]
preds = np.sum(p[p_idx,:] * q[q_idx,:], axis=1)
return true_ratings - preds

Note that, we are using the maps to map back the user and movie ids to corresponding rows and columns in P and Q respectively.

Now we come to the training part.

Since initially we do not know P and Q, we randomly initialize them from the N(0,1) distribution. From multiple experiments and observations, I found out that vanilla SGD (the standard one) is pretty slow in convergence. So I had used momentum to speed up the convergence a bit. Finally I went with the Adam algorithm for optimization.

def init_matrices(num_users, num_movies, latent_dim=128):
p, q = np.random.randn(num_users, latent_dim), np.random.randn(num_movies, latent_dim)
pm, qm = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
pv, qv = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
return p, q, pm, qm, pv, qv
def train_sgd_adam(training_data, validation_data, uid_map, mid_map, batch_size=32, max_tol_loss=0.5):
eta, lambdas = 0.001, 0.1
beta1, beta2 = 0.9, 0.999
eps = 1e-8
p, q, pm, qm, pv, qv = init_matrices(len(uid_map), len(mid_map), 128)
num_iter = 0
while True:
num_iter += 1
if num_iter % 1000 == 0:
errs_validation = get_ratings_errors(p, q, uid_map, mid_map, validation_data)
rmse_loss = np.sqrt(np.mean(errs_validation**2))
print rmse_loss
if rmse_loss < max_tol_loss:
break
selected_pairs = random.sample(training_data, batch_size)
errs_train = get_ratings_errors(p, q, uid_map, mid_map, selected_pairs)
errs_train = errs_train.reshape((errs_train.shape[0], -1))
p_idx, q_idx, true_ratings = map(list, zip(*selected_pairs))
p_idx = [uid_map[x] for x in p_idx]
q_idx = [mid_map[x] for x in q_idx]
x, y = p[p_idx,:], q[q_idx,:]
grad1, grad2 = -(errs_train * y - lambdas * x), -(errs_train * x - lambdas * y)
pm[p_idx,:] = beta1 * pm[p_idx,:] + (1 - beta1) * grad1
pv[p_idx,:] = beta2 * pv[p_idx,:] + (1 - beta2) * (grad1**2)
p[p_idx,:] += -eta * pm[p_idx,:]/(np.sqrt(pv[p_idx,:]) + eps)
qm[q_idx,:] = beta1 * qm[q_idx,:] + (1 - beta1) * grad2
qv[q_idx,:] = beta2 * qv[q_idx,:] + (1 - beta2) * (grad2**2)
q[q_idx,:] += -eta * qm[q_idx,:]/(np.sqrt(qv[q_idx,:]) + eps)
return p, q

Note that the Adam optimizer requires additional auxiliary matrices "pm", "pv", "qm", "qv", which are initialized with all zeros.

In each iteration, we randomly select 32 ratings from the training data, compute their errors in prediction, compute the gradients and then update the variables accordingly. To assess how the model is learning, after each 1000 iteration, I am computing the RMSE error on the validation data.

Stop iterating when RMSE loss on validation data is lower than maximum tolerable loss.

Since the validation data is out of sample for the training data, we need optimum regularization trade-off for the model to learn well. If the regularization constant is too less, then the models P and Q perform well on the training data but generalizes poorly on the validation data.

Recall that in the CF approach, we had normalized the ratings for each user (by subtracting the mean and dividing by std. dev.) due to inconsistent ratings. In the Matrix Factorization approach we can incorporate these factors into the model itself. Thus the normalization factors are learnt by the model.

Each rating can be thought of being composed as:

mean rating + bias of user + bias of movie + latent factors common to both user and movie.

The updated ratings equation:

\hat{r}_{ij}={\mu}+b_i+b_j+{\sum}_{k=1}^Kp_{ik}*q_{kj}

where {\mu} is the mean rating on the dataset, which is around 3.52, b_i is the bias of an user i.e. some users tend to give high ratings (3,4,5) while some users tend to give low ratings (1,2,3). b_j is the bias for a movie i.e. some movies tend to get very high ratings due to promotion and marketing.

One can remove the constant 'mean rating' component from the equation and let the other three components learn that. But that might lead to some additional number of iterations.

In order to incorporate bias terms in our code, we modify the 'get_ratings_error' function to add the additional factors.

def get_ratings_errors(p, q, uid_map, mid_map, bias_u, bias_m, mean_rating, mydata):
p_idx, q_idx, true_ratings = map(list, zip(*mydata))
p_idx = [uid_map[x] for x in p_idx]
q_idx = [mid_map[x] for x in q_idx]
preds = np.sum(p[p_idx,:] * q[q_idx,:], axis=1)
preds += bias_u[p_idx] + bias_m[q_idx] + mean_rating
return true_ratings - preds

We also need to initialize the bias vectors for both users and movies. The biases are all initialized with all zeros.

def init_biases(num_users, num_movies):
bias_u, bias_m = np.zeros((num_users)), np.zeros((num_movies))
b1m, b2m = np.zeros((num_users)), np.zeros((num_movies))
b1v, b2v = np.zeros((num_users)), np.zeros((num_movies))
return bias_u, bias_m, b1m, b2m, b1v, b2v

The additional terms are required for Adam optimization similar to the latent matrices P and Q.

We also need to update the bias terms as follows:

selected_pairs = random.sample(training_data, batch_size)
errs_train = get_ratings_errors(p, q, uid_map, mid_map, bias_u, bias_m, mean_rating, selected_pairs)
p_idx, q_idx, true_ratings = map(list, zip(*selected_pairs))
p_idx = [uid_map[x] for x in p_idx]
q_idx = [mid_map[x] for x in q_idx]
u, v = bias_u[p_idx], bias_m[q_idx]
gradb1, gradb2 = -(errs_train - lambdas * u), -(errs_train - lambdas * v)
b1m[p_idx] = beta1 * b1m[p_idx] + (1 - beta1) * gradb1
b1v[p_idx] = beta2 * b1v[p_idx] + (1 - beta2) * (gradb1**2)
bias_u[p_idx] += -eta * b1m[p_idx]/(np.sqrt(b1v[p_idx]) + eps)
b2m[q_idx] = beta1 * b2m[q_idx] + (1 - beta1) * gradb2
b2v[q_idx] = beta2 * b2v[q_idx] + (1 - beta2) * (gradb2**2)
bias_m[q_idx] += -eta * b2m[q_idx]/(np.sqrt(b2v[q_idx]) + eps)
errs_train = errs_train.reshape((errs_train.shape[0], -1))
x, y = p[p_idx,:], q[q_idx,:]
grad1, grad2 = -(errs_train * y - lambdas * x), -(errs_train * x - lambdas * y)
pm[p_idx,:] = beta1 * pm[p_idx,:] + (1 - beta1) * grad1
pv[p_idx,:] = beta2 * pv[p_idx,:] + (1 - beta2) * (grad1**2)
p[p_idx,:] += -eta * pm[p_idx,:]/(np.sqrt(pv[p_idx,:]) + eps)
qm[q_idx,:] = beta1 * qm[q_idx,:] + (1 - beta1) * grad2
qv[q_idx,:] = beta2 * qv[q_idx,:] + (1 - beta2) * (grad2**2)
q[q_idx,:] += -eta * qm[q_idx,:]/(np.sqrt(qv[q_idx,:]) + eps)

With 64 latent factors, the RMSE loss varies between 0.78 to 0.85 for the validation data set.

Note: Once can try to reduce the validation error by reducing the regularization constant whenever the RMSE error gets 'stuck'. But that might work with only an instance of validation data.

There is another popular method for finding the solution for the unknowns known as the Alternating Least Squares (ALS).

Instead of finding the updated values of both p_{ij} and q_{ij} simultaneously as in SGD, in ALS, we alternately fix either of p_{ij} or q_{ij} to be a constant and then update the value of the other and repeat this until convergence.

For e.g. in round 1, we fill matrix Q with some random values sampled from a normal distribution N(0,1).

Then the values q_{kj} in the loss function becomes a constant for us.

L = \sum_{ij}(r_{ij}-{\sum}_{k=1}^Kp_{ik}*q_{kj})^2 + {\lambda} * {\sum}_{k=1}^Kp^2_{ik}

We only need to find the values for p_{ik}.

Note that this is same as finding the coefficients in a linear regression problem. One can use SGD here too, but we can directly compute the values p_{ik} i.e. latent factor vector for user i, as follows:

Compute X=(Q^TQ+{\lambda}I)^{-1}Q^T, X is a KxM matrix.

p_{ik}=\sum_{w=1}^Mr_{iw}*X^T_{wk}

where the summation is over all movies user i has rated.

After we have updated the matrix P, in the next round, we fix the values of p_{ik} to the ones computed in the last iteration and proceed to update the unknowns for Q, i.e.

Compute Y=(P^TP+{\lambda}I)^{-1}P^T, Y is a KxN matrix.

q_{kj}=\sum_{w=1}^NY_{kw}*r_{wj}

where the summation is over all users who have rated movie j.

Moore-Penrose Pseudoinverse.

We repeat this process until the values of P and Q stops changing by more than some threshold.

With limited resources and for scalability reasons it is advisable to compute p_{ik} and q_{kj} using SGD technique instead of costly matrix operations like Q^TQ, P^TP or their inverses. For 1 million users,  P^TP would be of size 1012.

The advantage of ALS method over SGD is that the calculations for the p_{ik} and q_{kj} can be parallelized. In our SGD method, the current update for p_{ik} depended on the last update for q_{kj}, while the last update for q_{kj} depended on the last to last update for p_{ik} and thus each p_{ik} cannot be independently updated.

Practical Recommendations for ALS.

Whereas in ALS, since in each round, we hold either one of p_{ik} or q_{kj} constant, thus each p_{ik} or q_{kj} is independent in alternating rounds.

Before ending this post, I would like to point out that MF should be the preferred approach for a recommendation engine due to the following reasons:

  1. Scalability
    • In Content based approach, we require a model for each user. Thus with a million users, we would have a million models.
    • In CF approach, we needed to compute pairwise similarities between every pair of users or movies which is O(N2).
    • In MF approach, we only need to compute two matrices P and Q. We have seen that even with very less number of latent factors (64) we are able to generate good enough RMSE scores on the validation data.
    • With SGD optimization (+Adam), we can easily train with 20 million ratings in under 3-4 hours and compute predicted ratings for all unrated movies.
    • If we use ALS, then we can parallelize the computations also. We cannot parallelize user or item based CF effectively.
  2. Speed of Prediction
    • In CF based approach, we cannot produce prediction at run time because at run time we need to perform pairwise computations to find similar users or movies, which is not feasible.
    • In MF approach, if we can keep track of all movies an user has not rated yet, then we can produce predictions at run time very fast.

Edit:

I realized that the number of iterations to convergence of the Adam method could be reduced by selecting the initial random matrices P and Q well.

If we use the model without the bias terms i.e.

\hat{r}_{ij}={\sum}_{k=1}^Kp_{ik}*q_{kj}

Then we can choose the values of P and Q in a way such that their dot products are close to the mean rating of 3.52. For e.g. if I choose all of P and Q to be some 'x', then any dot product would be Kx^2 where K is the number of latent dimensions.

Thus if I choose the random values for P and Q from the following normal distribution:

N(\sqrt{\frac{3.52}{K}}, 0.001)

Then all dot products would approximately be close to 3.52 and the starting validation error would be quite close to the true error.

def init_matrices(num_users, num_movies, mean_rating, latent_dim=128):
mu = np.sqrt(mean_rating/latent_dim)
p = np.random.normal(mu, 0.001, num_users * latent_dim).reshape((num_users, latent_dim))
q = np.random.normal(mu, 0.001, num_movies * latent_dim).reshape((num_movies, latent_dim))
pm, qm = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
pv, qv = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
return p, q, pm, qm, pv, qv

While in the case of the model already with the mean rating and bias terms, i.e.

\hat{r}_{ij}={\mu}+b_i+b_j+{\sum}_{k=1}^Kp_{ik}*q_{kj}

The values of P and Q should be chosen to be close to 0 as much as possible to keep the initial errors as low as possible.

def init_matrices(num_users, num_movies, mean_rating, latent_dim=128):
p = np.random.normal(0.0, 0.001, num_users * latent_dim).reshape((num_users, latent_dim))
q = np.random.normal(0.0, 0.001, num_movies * latent_dim).reshape((num_movies, latent_dim))
pm, qm = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
pv, qv = np.zeros((num_users, latent_dim)), np.zeros((num_movies, latent_dim))
return p, q, pm, qm, pv, qv

Go to Part 3 to know about SVD++ technique.

External References:

Categories: AI, DESIGN, MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , ,