Machine Learning, AI and Programming

Designing Movie Recommendation Engines - Part I

In this post, we would be looking to design a movie recommendation engine with the MovieLens dataset. We will not be designing the architecture of such a system, but will be looking at different methods by which one can recommend movies to users that minimizes the root mean squared error of the predicted ratings from the actual ratings on a hold out validation dataset.

The intuitive way to recommend movies to an user is by predicting which movies the user would like to watch. Thus it is a binary classification problem where for each user, I have a model that predicts 1, if user would like a movie and 0 if user would not like it. But it does not end there. There could be thousands of movies for which the model predicts 1, but if we recommend all the thousand movies in a random order then the chances of the user being interested in the recommendations would be low, because the one movie that the user is most likely to watch could be located at 758th or 999th position in the recommended list.

We need to rank the movies. If we use a soft-max function at the output, then the predicted probability of 1 i.e. user liking a movie can be used to rank the recommendations.

But there are challenges with this approach :

  1. Algorithm requires a model for each user. Feasible with a few thousand users but for million users, poses a challenge for not only versioning and storing of models but updating models regularly.
  2. The data that we have, has only ratings. We need to infer whether an user likes (1) or dislikes (0) a movie based on the rating.
    • One simple way is to take the average rating an user has given. If the user has rated a movie above the average, then label it as 1 else 0.
    • c_{ij} = 1 if 0" /> else c_{ij} = 0
    • where c_{ij} is the class label for user i and movie j, r_{ij} is the rating given by user i to movie j and \hat{r_i} is the average rating given by user i.
  3. Algorithm requires knowing attributes or features of the movies in-order to build a classifier for each user.
    • Recall that in-order to build a binary classification model, we need a feature matrix (preferably TF-IDF weighted) and set of labels for each row of the matrix.
    • If for each movie, we had the movie plot, summary, actors/directors, year of release, genre etc. then for all movies that an user has rated, we can obtain a feature matrix out of these data and a label 1 or 0 based on the rating the user has given to the movie.
    • With the feature matrix for the un-rated movies, the binary classification model can predict 1 (like) or 0 (dislike) or can predict the probabilities for each row.
  4. Sparsity of ratings matrix.
    • The 20 millions ratings dataset has ratings for about 27K movies rated by 138K users. The sparsity is around 99.5% i.e. 1 - (20 / (138*27))
    • The average number of ratings given by any user is for about 144 movies, while the median is only 68 movies, i.e. 50% of the movies receives a rating from less than 68 users.
    • The distribution of number of ratings is very skewed. Very few users have given very large number of ratings. Maximum ratings any movie has received is around 10K.
    • Any binary classification model with only 144 training examples (would be lesser, if we also split it into training + validation data), would not be good enough, assuming that there could be lots of dimensionalities around which an user prefers to watch a movie.

Distribution of the Number of Ratings.

Note that instead of a binary classification problem, we can also use the following strategies :

  • Multi-class classifier, where the classes are 1-5 corresponding to the number of stars the user has given.
  • Regression problem. Minimize the RMSE loss between the actual rating and the predicted rating. The predicted rating in this case would be a real number and not an integer.

The above strategy of building a recommendation engine is known as Content Based Recommendation Algorithm.

We do not have sufficient data about features of movies, but we have data about how users have tagged movies they have watched. Based on all the tags for a movie along with the genre information, we create the TF-IDF weighted movie-tag matrix.

import os
import pandas as pd
from collections import defaultdict
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

dataset_path = "ml-20m"

ratings_df = pd.read_csv(os.path.join(dataset_path, "ratings.csv"), encoding="utf-8", sep=",")

user_id, movie_id, ratings = list(ratings_df[u'userId']), list(ratings_df[u'movieId']), list(ratings_df[u'rating'])

user_ratings_map = defaultdict(dict)

#Create a map for storing the ratings against an user id and a movie id as the key.
for idx in range(len(user_id)):
    user_ratings_map[user_id[idx]][movie_id[idx]] = float(ratings[idx])

#Read csv file for tags
tags_df = pd.read_csv(os.path.join(dataset_path, "tags.csv"), encoding="utf-8", sep=",")

#Read csv file for genres
genres_df = pd.read_csv(os.path.join(dataset_path, "movies.csv"), encoding="utf-8", sep=",")

stop_words = set(stopwords.words('english'))

movie_id, tags = list(tags_df[u'movieId']), list(tags_df[u'tag'])

tags = [str(tag) for tag in tags]

movie_tag_map = defaultdict(list)

for idx in range(len(movie_id)):
    tag = tags[idx].lower()
    tag = re.sub("[^a-zA-Z0-9 ]", " ", tag)
    tag = tag.strip()
    tag = re.sub("\s+", " ", tag)
    if len(tag) > 0:
        tag_words = tag.split(" ")
        tag = " ".join([x for x in tag_words if x not in stop_words])
movie_id, genres = list(genres_df[u'movieId']), list(genres_df[u'genres'])

for idx in range(len(movie_id)):
    genre = genres[idx].lower()
    all_genres = genre.split("|")
    for gen in all_genres:
movie_tags = []
movie_ids_index = defaultdict(int)

movie_ids = [m_id for m_id, _ in movie_tag_map.items()]

for idx in range(len(movie_ids)):
    m_id = movie_ids[idx]
    movie_ids_index[m_id] = idx

#Create the TF-IDF weighted movie-tag matrix
vectorizer = TfidfVectorizer(tokenizer=lambda sent: sent.split("###"), ngram_range=(1,1), stop_words='english')
movie_tag_mat = vectorizer.fit_transform(movie_tags)

Instead of solving the classification problem, we will be solving the regression problem, with the user ratings as is, because the regression already gives predicted rating, on which we can sort.

For training a model with each user id, we do a 5-fold cross validation with shuffling, to get the estimated validation error from the model.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

def get_user_model(user_id, user_ratings_map, movie_tag_mat, movie_ids_index):
    movie_ids = [m_id for m_id, rating in user_ratings_map[user_id].items() if m_id in movie_ids_index]
    movie_ids_rows = [movie_ids_index[m_id] for m_id in movie_ids]
    labels = np.array([rating for m_id, rating in user_ratings_map[user_id].items() if m_id in movie_ids_index])

    train_data = movie_tag_mat[movie_ids_rows,:]
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    errors = []
    for train_index, test_index in kf.split(train_data):
        X_train, X_test = train_data[train_index], train_data[test_index]
        y_train, y_test = labels[train_index], labels[test_index]

        model = LinearRegression(), y_train)
        preds = model.predict(X_test)
        errors.append(mean_squared_error(y_test, preds))
    model = LinearRegression(), labels)
    return model, np.mean(errors)

To get the recommendations for a particular user id, we can use the following function :

def get_recommendations(model, user_id, user_ratings_map, movie_tag_mat, movie_ids_index, num_rec=10):
    rated_movie_ids = set([m_id for m_id, _ in user_ratings_map[user_id].items()])
    unrated_movie_ids = [m_id for m_id, idx in movie_ids_index.items() if m_id not in rated_movie_ids and m_id in movie_ids_index]
    movie_ids_rows = [movie_ids_index[m_id] for m_id in movie_ids]
    test_data = movie_tag_mat[movie_ids_rows,:]
    preds = model.predict(test_data)
    preds = sorted(zip(unrated_movie_ids, preds), key=lambda k:-k[1])
    return preds[:min(len(preds), num_rec)]

It returns the top 10 recommended movie ids, sorted by the recommendation scores highest to lowest.

In order to evaluate our user models, for each user, we explicitly separate out 1% of the ratings for validation. We train the model on the remaining 99% of the ratings and also report the mean cross-validation error on the 99% data for each user. Lastly we predict the ratings for the 1% validation data using the corresponding user models and report the RMSE loss.

We only consider users who has rated 500 movies or more. Then we select the first 100 of such users.

import random

selected_user_ids = [user_id for user_id, rate_map in user_ratings_map.items() if len(rate_map) > 500]
selected_user_ids = selected_user_ids[:100]

validation_data = []

for user_id in selected_user_ids:
    movie_ratings_map = user_ratings_map[user_id]
    movie_ids = [m_id for m_id, rating in movie_ratings_map.items()]
    sampled_movie_ids = random.sample(movie_ids, int(0.01 * len(movie_ids)))
    for s_id in sampled_movie_ids:
        validation_data.append((user_id, s_id, user_ratings_map[user_id][s_id]))

Validation data contains around 800 user, movie pairs.

Train a model for each of the 100 selected user ids.

models = dict()

#Remove the ratings that are present in the validation data before modelling.
for user_id, movie_id, rating in validation_data:

for user_id in selected_user_ids:
    model, err = get_user_model(user_id, user_ratings_map, movie_tag_mat, movie_ids_index)
    models[user_id] = model
    print user_id, err

And then compute the RMSE loss on the validation data.

sums = 0.0

for user_id, movie_id, actual_rating in validation_data:
    model = models[user_id]
    pred_rating = get_predicted_rating(model, movie_id, movie_tag_mat, movie_ids_index)
    sums += (pred_rating - actual_rating) ** 2.0
print math.sqrt(float(sums)/len(validation_data))

The predicted rating function is defined as follows.

def get_predicted_rating(model, movie_id, movie_tag_mat, movie_ids_index):
    test_data = movie_tag_mat[movie_ids_index[movie_id],:]
    preds = model.predict(test_data)
    return preds[0]

The RMSE loss on the validation data comes to around 1.02, which implies that our model can predict a rating for an unrated movie for the same user, which can be off by 1 star or if the the user had given a rating of 4, the model can predict a rating of 3 or 5.

Since the number of training examples per user is very limited, as well as information about users and movies, we will not pursue this approach but we will pursue two other different approaches to recommendations :

  1. Collaborative Filtering. Recommending movies based on how similar users have rated (user based CF) or what similar movies the user has rated (item based CF).
  2. Matrix Factorisation. Since there is very less explicit data about features of a movie or preferences for those features by an user, we compute implicit latent features of movies from the given ratings matrix.

The difference between collaborative filtering and our earlier content based approach is that, in content based approach each user and each movie was independent, whereas in CF, we predict the rating by an user for a movie based on how similar users have rated that movie or how the same user have rated similar movies.

Collaborative Filtering. Source : Towards Data Science.

The CF approach is somewhat similar to clustering. Given the ratings matrix R, one can either cluster on the rows i.e. users, or the columns i.e. movies.

If for example, we cluster on the rows, i.e. the users, then after clustering, find the cluster in which our user lies. For any unrated movie by the user, fill it with the corresponding mean rating from all users in that cluster, who have rated that movie. Computing the mean with all rows and not only the rows with ratings, assumes that the user has given a 0 rating for that movie.

The way recommendation works is that, whenever an user logs into the site, show him/her the top 10-15 recommendations. Computing the mean ratings for all movies the user have not watched or rated, in real time would take time complexity of O(MN + NlogN), where M is the number of users in that cluster and N is the number of movies. The NlogN term is due to sorting the ratings highest to lowest.

Assuming that there are N=50K movies and each cluster holds around M=1000 users, then the number of computations is quite large to be accomplished quickly in real time. Better way is to pre-compute the recommendations for each user and store it in some Redis cluster.

With real-time computations the user would have got most up-to-date recommendations but it is of no use if the recommendations take 5-10 minutes to load.

There are several challenges also with this approach :

  • As seen in our earlier blog posts on Question-Answering, that there is trade-off in using centroid based clustering over hierarchical clustering and vice-versa.
    • With large number of users (a million), hierarchical clustering would require O(M2) = 1012 number of pairwise computations, where M is the total number of users.
    • On the other hand centroid based clustering such as K-Means is O(M), but it suffers from local optimas - also it is not stable, as the clusters depend on the choice of initial centroids.
    • Better way is again to go with multi-layer clustering i.e. first use K-Means to reduce the the sample space of each user, then use hierarchical clustering on each K-Means cluster.
  • The ratings given by an user does not follow any standard. For some user, a rating of 3 maybe actually bad if he generally gives ratings in the higher range (3-5).
    • Normalize each row of the ratings matrix by subtracting the mean rating from each individual rating and dividing by the standard deviation.
    • r'_{ij} = \frac{r_{ij}-\hat{r_i}}{\sigma_i}, where \hat{r_i} is the average rating and \sigma_i is the std. dev. of the ratings given by user i.
  • The predicted ratings obtained are just the mean of all the actual ratings.
    • Weight the ratings with how similar the users in a cluster are to the current user.
    • Let the similarity between user i and user k (who have rated movie j) be denoted as s_{ik}, then the predicted rating for movie j by user i is :
    • r_{ij} = {\sigma_i} * \frac{{\sum_k}s_{ik}*r'_{kj}}{{\sum_k}s_{ik}} + \hat{r_i}
    • We are multiplying with the std. dev. and adding the mean rating for user i, because the ratings r'_{kj} are the normalized ratings.
  • Clusters constrain the similar users to be from the same cluster.
    • While computing weighted mean of ratings, we consider an user in the same cluster "located at the opposite end of the cluster" to be similar but do not consider an user from a different cluster but "located near".
    • Instead of evaluating similar users only from same cluster, fetch the top 100 similar users from own cluster as well as users from 5 other nearest clusters (based on distance from cluster centroid).
  • Similarity measures.
    • Cannot use euclidean distances for weights, because, euclidean distances would consider each unrated movie to have rating of 0 which is not true.
    • Cosine similarity is better measure as it only considers similarities based on movies rated by both users.
    • But there are other problems with cosine similarity.
      • If two users have rated only one common movie between them, then their cosine similarity is 1.0 irrespective of what they have rated.
      • If user A gives a rating of 1 to movie X and user B gives a rating of 5 to X, their cosine similarity if (1*5)/(1*5) = 1.0.
      • Similarly if users A and B rated 'n' number of movies in common, but A rated all 'n' of them as 'a' while B rated all of these 'n' movies as 'b', then their cosine similarity is again 1.0.
      • Dot product = n*a*b, Norm A = a*n0.5, Norm B = b*n0.5, thus cosine similarity = (n*a*b)/(n*a*b) = 1.0
      • In mathematical terms, A and B are collinear vectors of different lengths.
    • We can modify euclidean metric, by computing the root mean square only with the common ratings, and then convert the distance metric to similarity using the formula :
    • similarity(A, B) = 1 - tanh(rms(A, B))
  • Cluster size.
    • There is a trade-off between predicting the number of unknown ratings vs. the quality of the predicted ratings due to the cluster size.
    • If a cluster size is large, then it implies that there could be at-least 50-100 similar users for an user present in the cluster.
      • The upside side to this is, chances of any movie receiving at-least one rating would be high, thus we can predict most of the unknown ratings.
      • The downside to this is that, for any movie which is rated by most or all users in the cluster, the mean rating would be "diluted" even if the ratings are weighted by the similarity scores. All such predicted ratings would be very close to each other (tending towards the mean rating 3.0-3.5).
      • Although we are not concerned about the absolute ratings since we only need them for ranking, but if predicted ratings are very close to each other, using a cut-off threshold to select recommendations becomes difficult.
    • If a cluster size is small, then it implies that only the most 3-4 similar users for each user are there in that cluster.
      • The upside to this is that predicted ratings would actually be well separated.
      • The downside is that most of the unknown ratings would still remain unpredicted.

Instead of clustering, I will do the following:

  • For each user_id U, movie_id M pair in the validation data:
    1. Find out all the user ids, who have given a rating for M. Filter only these user ids.
    2. Then find the 10 most similar user ids from within the filtered user ids in the last step based on our modified euclidean similarity metric.
    3. Calculate the weighted mean of the ratings (weighted by similarity) only for M, from the 10 user ids selected in step 2.
    4. Calculate the RMS loss between predicted rating in step 3 and the actual rating for U, M pair and then sum these RMSE losses.

Python methods for getting similar users (The number of similar users is by default set to 5, but can be varied to check how the RMS score changes):

import math

def user_similarity(user_ratings_1, user_ratings_2):
    sum1, sum2, sums = 0.0, 0.0, 0.0
    dist = 0.0
    movies_rated_1 = set([movie_id for movie_id, rating in user_ratings_1.items()])
    movies_rated_2 = set([movie_id for movie_id, rating in user_ratings_2.items()])
    common_rated = movies_rated_1.intersection(movies_rated_2)
    if len(common_rated) > 0:
        for movie_id in common_rated:
            rating_1, rating_2 = user_ratings_1[movie_id], user_ratings_2[movie_id]
            dist += (rating_1 - rating_2) ** 2
        dist /= float(len(common_rated))
        return 1.0 - np.tanh(math.sqrt(2*dist))
    return 0.0

def get_similar_users(user_id, all_user_ids, normalized_ratings, num_sim=5):
    sims = []
    for j in range(len(all_user_ids)):
        user_id_1 = all_user_ids[j]
        if user_id_1 != user_id:
            sim = user_similarity(normalized_ratings[user_id], normalized_ratings[user_id_1])
            sims.append((user_id_1, sim))
    sims = sorted(sims, key=lambda k:-k[1])
    sims = sims[:min(len(sims), num_sim)]
    return sims

Python method for getting predicted rating for an user id, movie id pair:

def get_predicted_rating(user_id, movie_id, normalized_ratings, user_mean_sd_ratings, similar_user_ids):
    sims = [sim for similar_uid, sim in similar_user_ids]
    sim_sum = np.sum(sims)
    pred = 0.0
    for similar_uid, sim in similar_user_ids:
        if movie_id in normalized_ratings[similar_uid]:
            rating = normalized_ratings[similar_uid][movie_id]
            pred += (sim * rating)/float(sim_sum)
    return user_mean_sd_ratings[user_id][1] * pred + user_mean_sd_ratings[user_id][0]

The RMSE validation error on the 800 user, movie pairs with the above user based CF algorithm is 0.56 (as compared to 1.02 with linear regression model). Which implies that the predicted rating can be off by 0.5 stars.

The performance of the user based CF approach is much better as compared to the linear regression model and it can be due to several reasons.

  • The tags provided by users to movies may not be reliable or according to consensus.
  • Tags can have several variations. For e.g. "thriller" and "suspense" could be counted the same.
  • Insufficient tagged data.
  • Influence of social network. Movies which are popular with viewers, tend to get more views and better ratings over the time due to social network effect. Collaborative filtering actually tries to capture this.
    • Two users located at two different parts of the world, are similar not because by some co-incident, but maybe because they have a central hub in their social network, that influences their similarity. E.g. both follow same celebrities on twitter.

Similar to user based CF, one can also do item based CF, where instead of finding similar users, we find similar movies to the movie being recommended. The implementation of that is pretty easy as we only need to change the order of the keys in the 'user_ratings_map' dictionary. First key should be the movie id and the second key, the user id.

Once we find the similar movies, we compute the weighted mean of all the ratings (weighted by the similarity).

Go to Part 2 to know more about Matrix Factorization techniques.

Online References:


Tags: , , , , ,