In the last two parts of this series, we have been looking at how to design and implement a movie recommendations engine using the MovieLens' 20 million ratings dataset. We have looked at some of the most common and standard techniques out there namely Content based recommendations, Collaborative Filtering and Latent Factor based Matrix Factorization strategy. Clearly CF and MF approaches emerged as the winners due to their accuracy and ease of implementation.

In this post we will look to integrate best of both neighborhood based CF approach as well as Latent Factors approach into a common SVD++ framework and see how that impacts the RMSE scores on a hold out validation dataset. This post is inspired from the following papers:

- Factorization Meets the Neighborhood.
- Advances in Collaborative Filtering.
- SVD++ and Probabilistic Matrix Factorization.

The idea is to create a single ratings model or framework which captures the following:

- Baseline estimates for users and movies, i.e. based only on the bias of an user U and the bias of a movie M what is the rating given to M by U.
- The rating can be expressed as , where is the rating given by user i to movie j, is the average of all ratings, is the bias of an user and is the bias of a movie.
- Note that the bias of an user is independent of the movie and vice versa.
- For example, consider the newly released movie Avengers' Infinity War. Assuming that the average rating for all movies on IMDB is 3.5.
- Given that it is an Avenger's movie which generally receives a lot of hype and publicity and given the suspense created around it, the movie would have a high positive bias.
- Let's assume that the bias for Infinity War is 0.7.
- Similarly, if the user rating the movie is already a Marvel fan, then he has a positive bias too. Let's assume the positive bias for the user is 0.4. Then the baseline estimate for the rating is 3.5 + 0.7 + 0.4 = 4.6.
- Similarly, if the user is not a Marvel fan or have not watched any of the prequels to Infinity War, then he would not appreciate the movie and would have a negative bias. If the negative bias for the user is -0.3, then the rating he would give is 3.5 + 0.7 - 0.3 = 3.9.
- The values for and can be estimated by minimizing the following loss function:
- .
- One can use gradient descent to find the unknowns.
- If there was only one unknown i.e. either or , then the loss is a convex function and the unknown can be found out by setting the partial derivative of L w.r.t. the unknown to 0.
- For e.g. if was constant, then we can show that: , where is the number of movies rated by user i.
- But the biases alone is not sufficient to capture all complexities of the ratings. Let's denote the factor for all future improvements.

- Neighborhood Model
- The user-based or item-based CF approaches are not model based approaches but are heuristics similar to K-Nearest Neighbors algorithm.
- Instead of explicitly computing similarities between movies rated by an user, one can define a neighborhood model:
- where the summation is over all the movies rated by user i.
- The weights measures the relevancy of movies j and k to each other, which is not computed using cosine or euclidean similarity but learnt by the model.
- The equation is similar to the PageRank formulation. If we denote , then the equation becomes:
- The weights is equivalent to the fraction of the out-going links from k to j. Unlike PageRank this factor is learnt from the data.
- Intuitively this means that the user will rate a movie high if he has rated similar movies highly.
- For e.g. take LOTR part 1 and 2. If the user has rated part 1 highly above its baseline estimate, then he is expected to rate part 2 highly above its baseline estimate.
- The loss function in this case would be:
- .
- Note that the baseline estimates are assumed to be fixed and only is assumed to be unknown.

- Implicit Feedback
- So far our model has only been considering explicit feedbacks in the form of ratings given by users to movies.
- This is problematic for a new user or an user who have rated only a few movies. In that case, the neighborhood model would not be able to capture the expected ratings for similar movies.
- For e.g. if the user have watched but not rated LOTR part 1, then although the weight for LOTR part 1 and 2 is very high, but since the rating is missing, thus the term for LOTR part 1 would be 0.
- In such cases if we have data about whether users have shown some kind of interest towards a movie in the form of watching it or renting the DVD (pre-Netflix and Chill era), then we can incorporate such implicit feedbacks into the model.
- Thus we need a separate factor that assigns weights to pairs of movies but without the explicit ratings term. Let's call these weights .
- The summation is over all movies k' for which the user i has shown either explicit or implicit interest in them. The user may not have rated movie k', but have just watched it. Thus the set of |k'| >= |k|.
- Since we do not have such implicit feedback for our data, we can use the ratings themselves as implicit feedbacks or use the tags information file i.e. tags given by users to movies
- Assuming that we have a binary matrix X for user vs. implicit feedback, where if user i have rated movie j or given a tag for the movie j else . Thus:
- The new loss function is as follows:
- .
- We need to learn the weights in addition to , and .

- Normalization
- Some users might watch or rate too many movies while some users very few. The 'heavy viewers' receive quality recommendations while the 'casual watchers' receive poor recommendations.
- While this seems to be the natural way to go but the idea of a recommendation engine is to engage more of 'casual watchers' on the site.
- One obvious way to do it is to normalize the neighborhood factors.
- Standardize the terms and to be of 0 mean and standard deviation 1.0
- Then the sum of these terms can also be standardized (assuming normal distributions) to be of mean 0 and std.dev. of 1.0 by the following transformations:
- where is the number of explicit ratings given user i and is the number of implicit feedbacks given user i.
- This normalization constant is because the variance of the sum of N normal distributions each with std.dev. of 1.0 is N, thus the std.dev. of the sum is .

- Latent Factor Model
- Recall that in the last post we had explored how to factorize the ratings matrix R into two latent factor matrices P and Q.
- where Z is the number of latent factors.
- The difference between the above neighborhood model and the latent factor model is that in the neighborhood model we do not consider the interaction b/w the users and movies.
- The latent factor model trains only with the explicit ratings, thus in order to introduce implicit feedbacks into the latent factor model, introduce a weights matrix Y of dimensions MxK:
- where again the is over the movies implicitly shown interest by user i.

- SVD++
- SVD++ method is an extension of the latent factor SVD algorithm by also incorporating the neighborhood information as well as implicit feedbacks from users.
- Finally we can integrate the neighborhood model with the latent factor model as follows:
- The final loss function with regularization is as follows:
- .
- The unknowns can be found out using gradient descent techniques as seen in the last post.
- For speed and efficiency purposes we will be using batch gradient descent along with Adam optimization technique.

Implementation:

Let's read the ratings data and create tuple of ratings ((user_id, movie_id, rating)):

from collections import defaultdict import numpy as np import os, random import pandas as pd import scipy from scipy import sparse dataset_path = "ml-20m" ratings_df = pd.read_csv(os.path.join(dataset_path, "ratings.csv"), encoding="utf-8", sep=",") user_id, movie_id, ratings = list(ratings_df[u'userId']), list(ratings_df[u'movieId']), list(ratings_df[u'rating']) uid_mid_pairs = zip(user_id, movie_id, ratings)

Create a dict that maps from an user_id to an row index in the ratings matrix:

uid_map = dict() user_ids = sorted(list(set(user_id))) for idx in range(len(user_ids)): uid_map[user_ids[idx]] = idx

Create another dict that maps from a movie_id to a column index in the ratings matrix:

movies_df = pd.read_csv(os.path.join(dataset_path, "movies.csv"), encoding="utf-8", sep=",") movie_ids = list(movies_df[u'movieId']) mid_map = dict() for idx in range(len(movie_ids)): mid_reverse_map[idx] = movie_ids[idx] mid_map[movie_ids[idx]] = idx

Replace the user_id and movie_id values in the 'uid_mid_pairs' list of tuples with the corresponding row and column indices of the ratings matrix:

for idx in range(len(uid_mid_pairs)): uid, mid, rating = uid_mid_pairs[idx] uid_mid_pairs[idx] = (uid_map[uid], mid_map[mid], rating)

Create a sparse CSR matrix from the ratings data:

uids, mids, ratings = map(list, zip(*uid_mid_pairs)) ratings_matrix = sparse.csr_matrix((ratings, (uids, mids)), shape=(len(uid_map), len(mid_map)))

Next we initialize the unknowns.

The weights 'weights1' is a matrix of dimensions MxM corresponding to the neighborhood weights in the above equations. Similarly, the weights 'weights2' is also a matrix of dimensions MxM corresponding to the implicit weights and the weights 'weights_y' is a matrix of dimensions MxK (K number of latent factors) corresponding the weights in the above equations.

n, m = ratings_matrix.shape latent_dim = 64 weights1 = np.full((m, m), 1.0/m) weights1_m, weights1_v = np.zeros((m, m)), np.zeros((m, m)) weights2 = np.full((m, m), 1.0/m) weights2_m, weights2_v = np.zeros((m, m)), np.zeros((m, m)) weights_y = np.random.normal(0.0, 0.001, m * latent_dim).reshape((m, latent_dim)) weights_ym, weights_yv = np.zeros((m, latent_dim)), np.zeros((m, latent_dim))

For each weight, there are two more corresponding auxiliary weights that is required for Adam optimization.

Next we initialize the latent factor matrices P and Q:

p = np.random.normal(0.0, 0.001, n * latent_dim).reshape((n, latent_dim)) q = np.random.normal(0.0, 0.001, m * latent_dim).reshape((m, latent_dim)) pm, qm = np.zeros((n, latent_dim)), np.zeros((m, latent_dim)) pv, qv = np.zeros((n, latent_dim)), np.zeros((m, latent_dim))

and then comes the biases and :

bias_u, bias_m = np.zeros(n), np.zeros(m) b1m, b2m = np.zeros(n), np.zeros(m) b1v, b2v = np.zeros(n), np.zeros(m)

We need to split the data into training and validation. So we randomly shuffle the ratings and choose the first 5000 for validation and use the remaining for training our model.

random.shuffle(uid_mid_pairs) validation_data, training_data = uid_mid_pairs[:5000], uid_mid_pairs[5000:]

For our implementation purpose we will assume that the implicit feedbacks comes only from the ratings and do not use the tags as such. Thus the normalization constants and are the same. These are just the inverse square root of the number of ratings for each user.

num_ratings = np.bincount(ratings_matrix.nonzero()[0]) normalizations = 1.0/np.sqrt(num_ratings + 0.001)

We define another variable 'implicit' which is nothing but the ratings converted into binary (replaced by 1 and 0) and then multiplied by the normalizations above.

implicit = sparse.csr_matrix((ratings_matrix != 0).astype(int).T.multiply(normalizations).T)

Let's break down the SVD++ equation:

into following components:

where

and

First we define the function for getting from a batch of ratings data:

def get_neighborhood_scores(ratings_matrix, weights1, weights2, bias_u, bias_m, mean_rating, mydata, normalizations, implicit): u_idx, m_idx, _ = map(list, zip(*mydata)) diff1, diff2 = get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, mydata, normalizations, implicit) a = diff1.multiply(weights1[m_idx]).sum(axis=1) b = diff1.multiply(weights2[m_idx]).sum(axis=1) return np.squeeze(np.asarray(a + b)) def get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, mydata, normalizations, implicit): u_idx, m_idx, _ = map(list, zip(*mydata)) ratings = ratings_matrix[u_idx] baselines = np.add.outer(bias_u[u_idx], bias_m) + mean_rating diff1 = sparse.csr_matrix((ratings.data - baselines[ratings.nonzero()], ratings.nonzero()), shape=baselines.shape) diff1 = diff1.T.multiply(normalizations[u_idx]).T diff2 = implicit[u_idx] return diff1, diff2

The method 'get_neighborhood_scores' returns the values for a batch of ratings tuples in the 'mydata' variable. Note that we are relying on matrix operations heavily in-order to optimize the run-time of building the model. Also we are trying to fit in sparse matrices wherever possible so as to reduce the memory requirements.

Next we define the function for obtaining from a batch of ratings data:

def get_latent_scores(ratings_matrix, p, q, weights_y, mydata, normalizations, implicit): u_idx, m_idx, true_ratings = map(list, zip(*mydata)) latent_nscores = get_latent_neighborhood_scores(ratings_matrix, weights_y, mydata, normalizations, implicit) return np.sum((p[u_idx] + latent_nscores) * q[m_idx], axis=1) def get_latent_neighborhood_scores(ratings_matrix, weights_y, mydata, normalizations, implicit): u_idx, m_idx, _ = map(list, zip(*mydata)) return implicit[u_idx].dot(weights_y)

The method 'get_latent_scores' returns the values for a batch of ratings tuples in the 'mydata' variable.

Then we combine all the scores to obtain the final predicted ratings and the difference from the true ratings as follows:

def get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, mydata, normalizations, implicit): u_idx, m_idx, true_ratings = map(list, zip(*mydata)) neighbor_scores = get_neighborhood_scores(ratings_matrix, weights1, weights2, bias_u, bias_m, mean_rating, mydata, normalizations, implicit) latent_scores = get_latent_scores(ratings_matrix, p, q, weights_y, mydata, normalizations, implicit) preds = bias_u[u_idx] + bias_m[m_idx] + mean_rating + latent_scores + neighbor_scores return true_ratings - preds

Once we have all the necessary functions in-place, we run our Adam optimizer to update the values of the unknowns as follows:

eta, lambdas = 0.001, 0.1 beta1, beta2 = 0.9, 0.999 eps = 1e-8 batch_size = 32 num_iter, losses = 0, [] while True: num_iter += 1 if num_iter % 1000 == 0: errs_validation = get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, validation_data, normalizations, implicit) rmse_loss = np.sqrt(np.mean(errs_validation**2)) losses.append(rmse_loss) print rmse_loss if rmse_loss < 0.5: break selected_pairs = random.sample(training_data, batch_size) errs_train = get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, selected_pairs, normalizations, implicit) u_idx, m_idx, _ = map(list, zip(*selected_pairs)) #Updates for the bias terms x, y = bias_u[u_idx], bias_m[m_idx] u1, v1 = b1m[u_idx], b1v[u_idx] u2, v2 = b2m[m_idx], b2v[m_idx] grad1, grad2 = -(errs_train - lambdas * x), -(errs_train - lambdas * y) u1 = beta1 * u1 + (1 - beta1) * grad1 v1 = beta2 * v1 + (1 - beta2) * (grad1**2) x += -eta * u1/(np.sqrt(v1) + eps) u2 = beta1 * u2 + (1 - beta1) * grad2 v2 = beta2 * v2 + (1 - beta2) * (grad2**2) y += -eta * u2/(np.sqrt(v2) + eps) bias_u[u_idx], bias_m[m_idx], b1m[u_idx], b1v[u_idx], b2m[m_idx], b2v[m_idx] = x, y, u1, v1, u2, v2 #Updates for the neighborhood terms 'weights1' and 'weights2' diff1, diff2 = get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, selected_pairs, normalizations, implicit) x, y = weights1[m_idx], weights2[m_idx] u1, v1 = weights1_m[m_idx], weights1_v[m_idx] u2, v2 = weights2_m[m_idx], weights2_v[m_idx] z1, z2 = np.array(diff1.T.multiply(errs_train).T.todense()), np.array(diff2.T.multiply(errs_train).T.todense()) grad1, grad2 = -(z1 - lambdas * x), -(z2 - lambdas * y) u1 = beta1 * u1 + (1 - beta1) * grad1 v1 = beta2 * v1 + (1 - beta2) * (grad1**2) x += -eta * u1/(np.sqrt(v1) + eps) u2 = beta1 * u2 + (1 - beta1) * grad2 v2 = beta2 * v2 + (1 - beta2) * (grad2**2) y += -eta * u2/(np.sqrt(v2) + eps) weights1[m_idx], weights2[m_idx], weights1_m[m_idx], weights1_v[m_idx], weights2_m[m_idx], weights2_v[m_idx] = x, y, u1, v1, u2, v2 #Updates for the latent factor terms P and Q x, y = p[u_idx], q[m_idx] neighborhood_scores = get_latent_neighborhood_scores(ratings_matrix, weights_y, selected_pairs, normalizations, implicit) u1, v1 = pm[u_idx], pv[u_idx] u2, v2 = qm[m_idx], qv[m_idx] z1, z2 = np.multiply(y.T, errs_train).T, np.multiply((x + neighborhood_scores).T, errs_train).T grad1, grad2 = -(z1 - lambdas * x), -(z2 - lambdas * y) u1 = beta1 * u1 + (1 - beta1) * grad1 v1 = beta2 * v1 + (1 - beta2) * (grad1**2) x += -eta * u1/(np.sqrt(v1) + eps) u2 = beta1 * u2 + (1 - beta1) * grad2 v2 = beta2 * v2 + (1 - beta2) * (grad2**2) y += -eta * u2/(np.sqrt(v2) + eps) p[u_idx], q[m_idx], pm[u_idx], pv[u_idx], qm[m_idx], qv[m_idx] = x, y, u1, v1, u2, v2 #Updates for the latent implicit weights 'weights_y' x = weights_y[m_idx] u1, v1 = weights_ym[m_idx], weights_yv[m_idx] latents = np.multiply(q[m_idx].T, normalizations[u_idx]).T z1 = np.multiply(latents.T, errs_train).T grad1 = -(z1 - lambdas * x) u1 = beta1 * u1 + (1 - beta1) * grad1 v1 = beta2 * v1 + (1 - beta2) * (grad1**2) x += -eta * u1/(np.sqrt(v1) + eps) weights_y[m_idx], weights_ym[m_idx], weights_yv[m_idx] = x, u1, v1

We fix the constant learning rate to 0.001 and the regularization constant as 0.1. The batch size for each iteration is set to 32.

After each 1000 iterations, we compute the RMSE loss on the validation data to know the progress of our model. I had been running the above code on a Jupyter notebook which gives the advantage that I can stop and start the training whenever I want. On starting the training after a stop, the model picks the last updated values of the unknowns and thus can learn continuously.

To give a glimpse of why the SVD++ algorithm does better than a normal baseline estimate or just the latent factor model as seen in the last post.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for just the baseline model:

After about 350x1000 iterations, the RMSE loss is around 0.91. Around convergence the RMSE loss was 0.87.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for the latent factor model (without implicit feedback):

After about 350x1000 iterations, the RMSE loss is around 0.875. Around convergence the RMSE loss was 0.855.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for the SVD++ model:

After about 350x1000 iterations, the RMSE loss is around 0.83. Around convergence the RMSE loss was 0.828.

Clearly the progression of the SVD++ model is better as compared to the baseline as well as to the latent factor model.

Categories: DESIGN, MACHINE LEARNING, PROBLEM SOLVING

Tags: Adam Optimization, Implicit Feedback, Matrix Factorization, Movie Recommendation, Neighborhood Model, Netflix, Recommendation System, SVD++