Stokastik

Machine Learning, AI and Programming

Designing Movie Recommendation Engines - Part III

In the last two parts of this series, we have been looking at how to design and implement a movie recommendations engine using the MovieLens' 20 million ratings dataset. We have looked at some of the most common and standard techniques out there namely Content based recommendations, Collaborative Filtering and Latent Factor based Matrix Factorization strategy. Clearly CF and MF approaches emerged as the winners due to their accuracy and ease of implementation.

In this post we will look to integrate best of both neighborhood based CF approach as well as Latent Factors approach into a common SVD++ framework and see how that impacts the RMSE scores on a hold out validation dataset. This post is inspired from the following papers:

  1. Factorization Meets the Neighborhood.
  2. Advances in Collaborative Filtering.
  3. SVD++ and Probabilistic Matrix Factorization.

The idea is to create a single ratings model or framework which captures the following:

  1. Baseline estimates for users and movies, i.e. based only on the bias of an user U and the bias of a movie M what is the rating given to M by U.
    • The rating can be expressed as r_{ij}=\mu+b_i+b_j, where r_{ij} is the rating given by user i to movie j, \mu is the average of all ratings, b_i is the bias of an user and b_j is the bias of a movie.
    • Note that the bias of an user is independent of the movie and vice versa.
    • For example, consider the newly released movie Avengers' Infinity War. Assuming that the average rating for all movies on IMDB is 3.5.
    • Given that it is an Avenger's movie which generally receives a lot of hype and publicity and given the suspense created around it, the movie would have a high positive bias.
    • Let's assume that the bias for Infinity War is 0.7.
    • Similarly, if the user rating the movie is already a Marvel fan, then he has a positive bias too. Let's assume the positive bias for the user is 0.4. Then the baseline estimate for the rating is 3.5 + 0.7 + 0.4 = 4.6.
    • Similarly, if the user is not a Marvel fan or have not watched any of the prequels to Infinity War, then he would not appreciate the movie and would have a negative bias. If the negative bias for the user is -0.3, then the rating he would give is 3.5 + 0.7 - 0.3 = 3.9.
    • The values for b_i and b_j can be estimated by minimizing the following loss function:
    • L=\sum_{ij}(r_{ij}-\mu-b_i-b_j)^2+\lambda*({\sum_i}b^2_i+{\sum_j}b^2_j).
    • One can use gradient descent to find the unknowns.
    • If there was only one unknown i.e. either b_i or b_j, then the loss is a convex function and the unknown can be found out by setting the partial derivative of L w.r.t. the unknown to 0.
    • For e.g. if b_j was constant, then we can show that: b_i=\frac{\sum_j(r_{ij}-\mu-b_j)}{\lambda+|R_i|}, where |R_i| is the number of movies rated by user i.
    • But the biases alone is not sufficient to capture all complexities of the ratings. Let's denote the factor b_{ij}=\mu+b_i+b_j for all future improvements.
  2. Neighborhood Model
    • The user-based or item-based CF approaches are not model based approaches but are heuristics similar to K-Nearest Neighbors algorithm.
    • Instead of explicitly computing similarities between movies rated by an user, one can define a neighborhood model:
    • \hat{r}_{ij}=b_{ij}+\sum_k(r_{ik}-b_{ik})*w_{jk}
    • where the summation \sum_k is over all the movies rated by user i.
    • The weights w_{jk} measures the relevancy of movies j and k to each other, which is not computed using cosine or euclidean similarity but learnt by the model.
    • The equation is similar to the PageRank formulation. If we denote P_j=r_{ij}-b_{ij}, then the equation becomes: P_j={\sum_k}w_{jk}*P_k
    • The weights w_{jk} is equivalent to the fraction of the out-going links from k to j. Unlike PageRank this factor is learnt from the data.
    • Intuitively this means that the user will rate a movie high if he has rated similar movies highly.
    • For e.g. take LOTR part 1 and 2. If the user has rated part 1 highly above its baseline estimate, then he is expected to rate part 2 highly above its baseline estimate.
    • The loss function in this case would be:
    • L=\sum_{ij}(r_{ij}-b_{ij}-\sum_k(r_{ik}-b_{ik})*w_{jk})^2+\lambda*({\sum_i}b^2_i+{\sum_j}b^2_j+\sum_{jk}w^2_{jk}).
    • Note that the baseline estimates b_{ik} are assumed to be fixed and only b_{ij} is assumed to be unknown.
  3. Implicit Feedback
    • So far our model has only been considering explicit feedbacks in the form of ratings given by users to movies.
    • This is problematic for a new user or an user who have rated only a few movies. In that case, the neighborhood model would not be able to capture the expected ratings for similar movies.
    • For e.g. if the user have watched but not rated LOTR part 1, then although the weight w_{jk} for LOTR part 1 and 2 is very high, but since the rating is missing, thus the term (r_{ik}-b_{ik})*w_{jk} for LOTR part 1 would be 0.
    • In such cases if we have data about whether users have shown some kind of interest towards a movie in the form of watching it or renting the DVD (pre-Netflix and Chill era), then we can incorporate such implicit feedbacks into the model.
    • Thus we need a separate factor that assigns weights to pairs of movies but without the explicit ratings term. Let's call these weights w'_{jk'}.
    • \hat{r}_{ij}=b_{ij}+\sum_k(r_{ik}-b_{ik})*w_{jk}+\sum_{k'}w_{jk'}
    • The summation \sum_{k'} is over all movies k' for which the user i has shown either explicit or implicit interest in them. The user may not have rated movie k', but have just watched it. Thus the set of |k'| >= |k|.
    • Since we do not have such implicit feedback for our data, we can use the ratings themselves as implicit feedbacks or use the tags information file i.e. tags given by users to movies
    • Assuming that we have a binary matrix X for user vs. implicit feedback, where x_{ij}=1 if user i have rated movie j or given a tag for the movie j else x_{ij}=0. Thus:
    • \hat{r}_{ij}=b_{ij}+\sum_k(r_{ik}-b_{ik})*w_{jk}+\sum_{k'}w'_{jk'}
    • The new loss function is as follows:
    • L=\sum_{ij}(r_{ij}-\hat{r}_{ij})^2+\lambda*({\sum_i}b^2_i+{\sum_j}b^2_j+\sum_{jk}w^2_{jk}+\sum_{jk'}w'^2_{jk'}).
    • We need to learn the weights w'_{jk'} in addition to b_i, b_j and w_{jk}.
  4. Normalization
    • Some users might watch or rate too many movies while some users very few. The 'heavy viewers' receive quality recommendations while the 'casual watchers' receive poor recommendations.
    • While this seems to be the natural way to go but the idea of a recommendation engine is to engage more of 'casual watchers' on the site.
    • One obvious way to do it is to normalize the neighborhood factors.
    • Standardize the terms (r_{ik}-b_{ik})*w_{jk} and w'_{jk'} to be of 0 mean and standard deviation 1.0
    • Then the sum of these terms can also be standardized (assuming normal distributions) to be of mean 0 and std.dev. of 1.0 by the following transformations:
    • \hat{r}_{ij}=b_{ij}+|H_i|^{-0.5}\sum_k(r_{ik}-b_{ik})*w_{jk}+|G_i|^{-0.5}\sum_{k'}w'_{jk'}
    • where |H_i| is the number of explicit ratings given user i and |G_i| is the number of implicit feedbacks given user i.
    • This normalization constant is because the variance of the sum of N normal distributions each with std.dev. of 1.0 is N, thus the std.dev. of the sum is \sqrt{N}.
  5. Latent Factor Model
    • Recall that in the last post we had explored how to factorize the ratings matrix R into two latent factor matrices P and Q.
    • \hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}p_{iz}*q_{jz}
    • where Z is the number of latent factors.
    • The difference between the above neighborhood model and the latent factor model is that in the neighborhood model we do not consider the interaction b/w the users and movies.
    • The latent factor model trains only with the explicit ratings, thus in order to introduce implicit feedbacks into the latent factor model, introduce a weights matrix Y of dimensions MxK:
    • \hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}p_{iz}*q_{jz}+|G_i|^{-0.5}\sum_{k'}{\sum_{z=1}^Z}q_{k'z}*y_{k'z}
    • \hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}q_{jz}*(p_{iz}+|G_i|^{-0.5}\sum_{k'}y_{k'z})
    • where again the \sum_{k'} is over the movies implicitly shown interest by user i.
  6. SVD++
    • SVD++ method is an extension of the latent factor SVD algorithm by also incorporating the neighborhood information as well as implicit feedbacks from users.
    • Finally we can integrate the neighborhood model with the latent factor model as follows:
    • \hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}q_{jz}*(p_{iz}+|G_i|^{-0.5}\sum_{k'}y_{k'z})+|H_i|^{-0.5}\sum_k(r_{ik}-b_{ik})*w_{jk}+|G_i|^{-0.5}\sum_{k'}w'_{jk'}
    • The final loss function with regularization is as follows:
    • L=\sum_{ij}(r_{ij}-\hat{r}_{ij})^2+\lambda*({\sum_i}b^2_i+{\sum_j}b^2_j+\sum_{jk}w^2_{jk}+\sum_{jk'}w'^2_{jk'}+\sum_{iz}p^2_{iz}+\sum_{jz}q^2_{jz}+\sum_{jz}y^2_{jz}).
    • The unknowns can be found out using gradient descent techniques as seen in the last post.
    • For speed and efficiency purposes we will be using batch gradient descent along with Adam optimization technique.

Implementation:

Let's read the ratings data and create tuple of ratings ((user_id, movie_id, rating)):

from collections import defaultdict
import numpy as np
import os, random
import pandas as pd
import scipy
from scipy import sparse
dataset_path = "ml-20m"
ratings_df = pd.read_csv(os.path.join(dataset_path, "ratings.csv"), encoding="utf-8", sep=",")
user_id, movie_id, ratings = list(ratings_df[u'userId']), list(ratings_df[u'movieId']), list(ratings_df[u'rating'])
uid_mid_pairs = zip(user_id, movie_id, ratings)

Create a dict that maps from an user_id to an row index in the ratings matrix:

uid_map = dict()
user_ids = sorted(list(set(user_id)))
for idx in range(len(user_ids)):
uid_map[user_ids[idx]] = idx

Create another dict that maps from a movie_id to a column index in the ratings matrix:

movies_df = pd.read_csv(os.path.join(dataset_path, "movies.csv"), encoding="utf-8", sep=",")
movie_ids = list(movies_df[u'movieId'])
mid_map = dict()
for idx in range(len(movie_ids)):
mid_reverse_map[idx] = movie_ids[idx]
mid_map[movie_ids[idx]] = idx

Replace the user_id and movie_id values in the 'uid_mid_pairs' list of tuples with the corresponding row and column indices of the ratings matrix:

for idx in range(len(uid_mid_pairs)):
uid, mid, rating = uid_mid_pairs[idx]
uid_mid_pairs[idx] = (uid_map[uid], mid_map[mid], rating)

Create a sparse CSR matrix from the ratings data:

uids, mids, ratings = map(list, zip(*uid_mid_pairs))
ratings_matrix = sparse.csr_matrix((ratings, (uids, mids)), shape=(len(uid_map), len(mid_map)))

Next we initialize the unknowns.

The weights 'weights1' is a matrix of dimensions MxM corresponding to the neighborhood weights w_{jk} in the above equations. Similarly, the weights 'weights2' is also a matrix of dimensions MxM corresponding to the implicit weights w'_{jk} and the weights 'weights_y' is a matrix of dimensions MxK (K number of latent factors) corresponding the weights y_{k'z} in the above equations.

n, m = ratings_matrix.shape
latent_dim = 64
weights1 = np.full((m, m), 1.0/m)
weights1_m, weights1_v = np.zeros((m, m)), np.zeros((m, m))
weights2 = np.full((m, m), 1.0/m)
weights2_m, weights2_v = np.zeros((m, m)), np.zeros((m, m))
weights_y = np.random.normal(0.0, 0.001, m * latent_dim).reshape((m, latent_dim))
weights_ym, weights_yv = np.zeros((m, latent_dim)), np.zeros((m, latent_dim))

For each weight, there are two more corresponding auxiliary weights that is required for Adam optimization.

Next we initialize the latent factor matrices P and Q:

p = np.random.normal(0.0, 0.001, n * latent_dim).reshape((n, latent_dim))
q = np.random.normal(0.0, 0.001, m * latent_dim).reshape((m, latent_dim))
pm, qm = np.zeros((n, latent_dim)), np.zeros((m, latent_dim))
pv, qv = np.zeros((n, latent_dim)), np.zeros((m, latent_dim))

and then comes the biases b_i and b_j:

bias_u, bias_m = np.zeros(n), np.zeros(m)
b1m, b2m = np.zeros(n), np.zeros(m)
b1v, b2v = np.zeros(n), np.zeros(m)

We need to split the data into training and validation. So we randomly shuffle the ratings and choose the first 5000 for validation and use the remaining for training our model.

random.shuffle(uid_mid_pairs)
validation_data, training_data = uid_mid_pairs[:5000], uid_mid_pairs[5000:]

For our implementation purpose we will assume that the implicit feedbacks comes only from the ratings and do not use the tags as such. Thus the normalization constants |H_i| and |G_i| are the same. These are just the inverse square root of the number of ratings for each user.

num_ratings = np.bincount(ratings_matrix.nonzero()[0])
normalizations = 1.0/np.sqrt(num_ratings + 0.001)

We define another variable 'implicit' which is nothing but the ratings converted into binary (replaced by 1 and 0) and then multiplied by the normalizations above.

implicit = sparse.csr_matrix((ratings_matrix != 0).astype(int).T.multiply(normalizations).T)

Let's break down the SVD++ equation:

\hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}q_{jz}*(p_{iz}+|G_i|^{-0.5}\sum_{k'}y_{k'z})+|H_i|^{-0.5}\sum_k(r_{ik}-b_{ik})*w_{jk}+|G_i|^{-0.5}\sum_{k'}w'_{jk'}

into following components:

\hat{r}_{ij}=b_{ij}+X_{ij}+Y_{ij}

where

X_{ij}={\sum_{z=1}^Z}q_{jz}*(p_{iz}+|G_i|^{-0.5}\sum_{k'}y_{k'z}) and

Y_{ij}=|H_i|^{-0.5}\sum_k(r_{ik}-b_{ik})*w_{jk}+|G_i|^{-0.5}\sum_{k'}w'_{jk'}

First we define the function for getting Y_{ij} from a batch of ratings data:

def get_neighborhood_scores(ratings_matrix, weights1, weights2, bias_u, bias_m, mean_rating, mydata, normalizations, implicit):
u_idx, m_idx, _ = map(list, zip(*mydata))
diff1, diff2 = get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, mydata, normalizations, implicit)
a = diff1.multiply(weights1[m_idx]).sum(axis=1)
b = diff1.multiply(weights2[m_idx]).sum(axis=1)
return np.squeeze(np.asarray(a + b))
def get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, mydata, normalizations, implicit):
u_idx, m_idx, _ = map(list, zip(*mydata))
ratings = ratings_matrix[u_idx]
baselines = np.add.outer(bias_u[u_idx], bias_m) + mean_rating
diff1 = sparse.csr_matrix((ratings.data - baselines[ratings.nonzero()], ratings.nonzero()), shape=baselines.shape)
diff1 = diff1.T.multiply(normalizations[u_idx]).T
diff2 = implicit[u_idx]
return diff1, diff2

The method 'get_neighborhood_scores' returns the values Y_{ij} for a batch of ratings tuples in the 'mydata' variable. Note that we are relying on matrix operations heavily in-order to optimize the run-time of building the model. Also we are trying to fit in sparse matrices wherever possible so as to reduce the memory requirements.

Next we define the function for obtaining X_{ij} from a batch of ratings data:

def get_latent_scores(ratings_matrix, p, q, weights_y, mydata, normalizations, implicit):
u_idx, m_idx, true_ratings = map(list, zip(*mydata))
latent_nscores = get_latent_neighborhood_scores(ratings_matrix, weights_y, mydata, normalizations, implicit)
return np.sum((p[u_idx] + latent_nscores) * q[m_idx], axis=1)
def get_latent_neighborhood_scores(ratings_matrix, weights_y, mydata, normalizations, implicit):
u_idx, m_idx, _ = map(list, zip(*mydata))
return implicit[u_idx].dot(weights_y)

The method 'get_latent_scores' returns the values X_{ij} for a batch of ratings tuples in the 'mydata' variable.

Then we combine all the scores to obtain the final predicted ratings and the difference from the true ratings as follows:

def get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, mydata, normalizations, implicit):
u_idx, m_idx, true_ratings = map(list, zip(*mydata))
neighbor_scores = get_neighborhood_scores(ratings_matrix, weights1, weights2, bias_u, bias_m, mean_rating, mydata, normalizations, implicit)
latent_scores = get_latent_scores(ratings_matrix, p, q, weights_y, mydata, normalizations, implicit)
preds = bias_u[u_idx] + bias_m[m_idx] + mean_rating + latent_scores + neighbor_scores
return true_ratings - preds

Once we have all the necessary functions in-place, we run our Adam optimizer to update the values of the unknowns as follows:

eta, lambdas = 0.001, 0.1
beta1, beta2 = 0.9, 0.999
eps = 1e-8
batch_size = 32
num_iter, losses = 0, []
while True:
num_iter += 1
if num_iter % 1000 == 0:
errs_validation = get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, validation_data, normalizations, implicit)
rmse_loss = np.sqrt(np.mean(errs_validation**2))
losses.append(rmse_loss)
print rmse_loss
if rmse_loss < 0.5:
break
selected_pairs = random.sample(training_data, batch_size)
errs_train = get_ratings_errors(ratings_matrix, p, q, weights1, weights2, weights_y, bias_u, bias_m, mean_rating, selected_pairs, normalizations, implicit)
u_idx, m_idx, _ = map(list, zip(*selected_pairs))
#Updates for the bias terms
x, y = bias_u[u_idx], bias_m[m_idx]
u1, v1 = b1m[u_idx], b1v[u_idx]
u2, v2 = b2m[m_idx], b2v[m_idx]
grad1, grad2 = -(errs_train - lambdas * x), -(errs_train - lambdas * y)
u1 = beta1 * u1 + (1 - beta1) * grad1
v1 = beta2 * v1 + (1 - beta2) * (grad1**2)
x += -eta * u1/(np.sqrt(v1) + eps)
u2 = beta1 * u2 + (1 - beta1) * grad2
v2 = beta2 * v2 + (1 - beta2) * (grad2**2)
y += -eta * u2/(np.sqrt(v2) + eps)
bias_u[u_idx], bias_m[m_idx], b1m[u_idx], b1v[u_idx], b2m[m_idx], b2v[m_idx] = x, y, u1, v1, u2, v2
#Updates for the neighborhood terms 'weights1' and 'weights2'
diff1, diff2 = get_neighborhood_diffs(ratings_matrix, bias_u, bias_m, mean_rating, selected_pairs, normalizations, implicit)
x, y = weights1[m_idx], weights2[m_idx]
u1, v1 = weights1_m[m_idx], weights1_v[m_idx]
u2, v2 = weights2_m[m_idx], weights2_v[m_idx]
z1, z2 = np.array(diff1.T.multiply(errs_train).T.todense()), np.array(diff2.T.multiply(errs_train).T.todense())
grad1, grad2 = -(z1 - lambdas * x), -(z2 - lambdas * y)
u1 = beta1 * u1 + (1 - beta1) * grad1
v1 = beta2 * v1 + (1 - beta2) * (grad1**2)
x += -eta * u1/(np.sqrt(v1) + eps)
u2 = beta1 * u2 + (1 - beta1) * grad2
v2 = beta2 * v2 + (1 - beta2) * (grad2**2)
y += -eta * u2/(np.sqrt(v2) + eps)
weights1[m_idx], weights2[m_idx], weights1_m[m_idx], weights1_v[m_idx], weights2_m[m_idx], weights2_v[m_idx] = x, y, u1, v1, u2, v2
#Updates for the latent factor terms P and Q
x, y = p[u_idx], q[m_idx]
neighborhood_scores = get_latent_neighborhood_scores(ratings_matrix, weights_y, selected_pairs, normalizations, implicit)
u1, v1 = pm[u_idx], pv[u_idx]
u2, v2 = qm[m_idx], qv[m_idx]
z1, z2 = np.multiply(y.T, errs_train).T, np.multiply((x + neighborhood_scores).T, errs_train).T
grad1, grad2 = -(z1 - lambdas * x), -(z2 - lambdas * y)
u1 = beta1 * u1 + (1 - beta1) * grad1
v1 = beta2 * v1 + (1 - beta2) * (grad1**2)
x += -eta * u1/(np.sqrt(v1) + eps)
u2 = beta1 * u2 + (1 - beta1) * grad2
v2 = beta2 * v2 + (1 - beta2) * (grad2**2)
y += -eta * u2/(np.sqrt(v2) + eps)
p[u_idx], q[m_idx], pm[u_idx], pv[u_idx], qm[m_idx], qv[m_idx] = x, y, u1, v1, u2, v2
#Updates for the latent implicit weights 'weights_y'
x = weights_y[m_idx]
u1, v1 = weights_ym[m_idx], weights_yv[m_idx]
latents = np.multiply(q[m_idx].T, normalizations[u_idx]).T
z1 = np.multiply(latents.T, errs_train).T
grad1 = -(z1 - lambdas * x)
u1 = beta1 * u1 + (1 - beta1) * grad1
v1 = beta2 * v1 + (1 - beta2) * (grad1**2)
x += -eta * u1/(np.sqrt(v1) + eps)
weights_y[m_idx], weights_ym[m_idx], weights_yv[m_idx] = x, u1, v1

We fix the constant learning rate to 0.001 and the regularization constant as 0.1. The batch size for each iteration is set to 32.

After each 1000 iterations, we compute the RMSE loss on the validation data to know the progress of our model. I had been running the above code on a Jupyter notebook which gives the advantage that I can stop and start the training whenever I want. On starting the training after a stop, the model picks the last updated values of the unknowns and thus can learn continuously.

To give a glimpse of why the SVD++ algorithm does better than a normal baseline estimate or just the latent factor model as seen in the last post.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for just the baseline model:

\hat{r}_{ij}=b_{ij}

Evolution of RMSE loss with number of iterations. (Baseline model)

After about 350x1000 iterations, the RMSE loss is around 0.91. Around convergence the RMSE loss was 0.87.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for the latent factor model (without implicit feedback):

\hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}q_{jz}*p_{iz}

Evolution of RMSE loss with number of iterations. (Latent Factor model)

After about 350x1000 iterations, the RMSE loss is around 0.875. Around convergence the RMSE loss was 0.855.

The following is the evolution of the RMSE loss on the validation data (1000 iteration interval) for the SVD++ model:

\hat{r}_{ij}=b_{ij}+{\sum_{z=1}^Z}q_{jz}*(p_{iz}+|G_i|^{-0.5}\sum_{k'}y_{k'z})+|H_i|^{-0.5}\sum_k(r_{ik}-b_{ik})*w_{jk}+|G_i|^{-0.5}\sum_{k'}w'_{jk'}

Evolution of RMSE loss with number of iterations. (SVD++ model)

After about 350x1000 iterations, the RMSE loss is around 0.83. Around convergence the RMSE loss was 0.828.

Clearly the progression of the SVD++ model is better as compared to the baseline as well as to the latent factor model.

Categories: DESIGN, MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , , , ,

Leave a Reply