Machine Learning, AI and Programming

Factorization Machines for Movie Recommendations

In the last series of posts we have looked at how to recommend movies to users based on the historical ratings. The two most promising approaches were Collaborative Filtering and Matrix Factorization. Both these approaches learns the user-movie preferences only from the ratings matrix.

Recall that in the first post of the series, we had started with an approach known as the Content Based Recommendation, where we created a regression model for each user. Based on the movies an user has rated and the tags received on each of these movies, we created a TF-IDF matrix for each user with the ratings as the output labels. Then fitted a Neural Network regression model to predict the unknown ratings.

The problem with this approach is that there has to be a model for each user and having a model for each user is not scalable.

This problem arises due to the fact that if we had a single global model for all ratings with the movie tags as feature vectors, then for the same feature vector we would be having different output labels. Thus we need additional features for each rating to make these vectors non-unique.

Few 'easy to include' features could be:

  • One hot encoding for the user of that rating. For e.g. if there are 100 users, then for each rating we would be having a feature vector of size 100, with all values set to 0 except for the current user which is 1. This 100 length vector can be concatenated with the tag features.
  • One hot encoding for the movie.
  • Genre of the movie.
  • User features if available, for e.g. age group, country, occupation, user preferences for particular genres etc.
  • What other movies the user has rated. Basically concatenating each row of the ratings matrix with each feature vector. The values could be the ratings or just 1 if rated else 0.
  • Count of 1, 2, 3, 4 and 5 stars received by the movie. User might bias his rating based on these counts.
  • Count of 1, 2, 3, 4 and 5 stars given by the user.
  • Average of the last 5 ratings received for the movie and average of the last 5 ratings given by the user.
  • ...and so on.

What we have seen in the matrix factorization approaches so far is that, instead of explicitly using the above features, the algorithms tries to implicitly learn them, for e.g. latent factors, user and movie biases, movie similarity  etc.

While on the other hand, the feature based modeling approaches seen so far does not include user-movie interactions. There is a more generic technique where feature based modeling can be combined with factorization, known as Factorization Machines. The word 'Machines' is derived from Support Vector Machines, we will see why.

Original paper of Factorization Machines.

Let's assume that we have the following data available (from MovieLens 20 million dataset):

  1. The ratings matrix
  2. Tags given to movies.
  3. Genres of movies.

For each rating y_k=r_{ij} given by user i for movie j, we create feature vectors x_k. The feature vectors x_k is the concatenation of the following feature vectors:

x_k=x^{(1)}_k\;||\;x^{(2)}_k\;||\;x^{(3)}_k\;||\;x^{(4)}_k\;||\;x^{(5)}_k, where || is concatenation operator.

x^{(1)}_k: one hot encoding for the user i, i.e. it is a vector of size 27k, where x^{(1)}_k=0 except for x^{(1)}_{ki}=1.

x^{(2)}_k: one hot encoding for the movie j, i.e. it is a vector of size 138k, where x^{(2)}_k=0 except for x^{(2)}_{kj}=1.

x^{(3)}_k: TF-IDF vector for all tags for movie j.

x^{(4)}_k: TF-IDF vector for genres of movie j.

x^{(5)}_k: Row r_i of the ratings matrix binarized.

In Factorization Machines, we model the ratings y as following:


Essentially this is a non-linear regression equation.

The input features x can be represented as a matrix N of size KxZ, where K is the number of available ratings and Z is the number of possible features.

The value w_p represents the weight for the features x_p.

V is a matrix of H dimensional representation for each feature. Thus V is of dimensions ZxH.

v_p is the factor vector corresponding to the p-th feature. The term v^T_pv_q represents the dot product between the p-th feature and q-th feature. Think of these vectors as similar to embeddings for word features in NLP.

Note that now each rating is expressed in terms of individual features of both user and movie as well as pair-wise interactions between these features. Unlike in matrix factorization, we do not have separate latent factor matrices for user and movie since we do not separate the user and movie features.

The unknown parameters to learn are w_0, w_p and the factor matrix V. Thus the total number of unknowns to learn is O(ZH).

One can learn these parameters using gradient descent techniques. The simplest gradient descent technique computes the partial derivatives of the RMSE loss w.r.t. each unknown. For e.g. the update equations for w_p are as follows:

w_p=w_p-\eta*\frac{{\partial}L}{{\partial}w_p}, where \eta is the learning rate


L=\sum_{k=1}^N(y_k-\hat{y}_k)^2+\lambda*(w^2_0+\sum_{p=1}^Zw^2_p+\sum_{p=1}^Z||v_p||^2), \lambda is the regularization constant

The partial derivatives are computed as:


where e_k=y_k-\hat{y}_k.

In stochastic gradient descent we take only one of the ratings, i.e.

w_p=w_p-\eta*(e_k*x_{kp}-{\lambda}*w_p)  and  w_0=w_0-\eta*(e_k-{\lambda}*w_0)

Similarly, the SGD update equation for the factors v_p are as follows:

v_p=v_p-\eta*(e_k*x_{kp}*(\sum_{q=1}^Zv_q*x_{kq}-v_p*x_{kp})-{\lambda}*v_p), note that v_p is a vector whereas w_p is a scalar.

But how are FM similar to SVM ?

Recall that in a SVM classifier, the predicted labels are calculated as:


where \text{sgn}(x) is the sign function, i.e. \text{sgn}(x)=1 if x > 0 else -1, and u_0 and u_p are the weights to be learnt by the SVM model.

x_p are the features for the classifier.

Let's denote f(x_k)=u_0+\sum_{p=1}^Zu_p*x_{kp}.

SVM and similar models like Logistic Regression and Linear Regression can be 'kernelized', i.e. instead of using the features x_p, transform these features into a higher dimensional space through the transformation g(x)


Without getting into the details of Lagrange multipliers, the weights u_p are calculated as:

u_p=\sum_{s=1}^S{\alpha_s}*y_s*g(x_{sp}), where \alpha_s are the Lagrange multipliers and y_s are the true labels.

For SVM's, the summation \sum_{s=1}^S is over the support vectors only, whereas in Kernel Logistic Regression or Kernel Ridge Regression, the summation is over all input examples.




where the term {g(x_s)}^T{g(x_k)} represents the dot product between an example 'k' with one of the support vectors 's'.

In theory, there are certain class of functions K(x, x'), which can be expressed in terms of the dot product:


Thus if we choose functions K(x, x') appropriately, we do not need to compute g(x) explicitly, because we can solve the equations with K(x, x') only, i.e.

f(x_k)=u_0+\sum_{s=1}^S{\alpha_s}*y_s*K(x_k, x_s)

The functions K(x, x') are known as kernel functions.

In SVM's the kernel functions can be linear, polynomial, radial basis and so on.

For a linear kernel, K(x, x')=x^Tx' is simply the dot product of the original feature vectors. In such a case, the transformation g(x) is simply the original features, i.e.

g(x)=[x_1, x_2, ..., x_Z]

Then the original primal equation is just:


Similarly, if the kernel is a polynomial function of degree d=2, then:

K(x, x')=(1+x^Tx')^2

If we try to obtain the transformation g(x) for this kernel, we get the following:

g(x)=[1, \sqrt{2}x_1, \sqrt{2}x_2, ..., x^2_1, x^2_2, ..., \sqrt{2}x_1x_2, \sqrt{2}x_1x_3, ..., \sqrt{2}x_ix_j,...]

One can verify that {g(x)}^T{g(x')} gives K(x, x').

Then the original primal equation is:


where I have broken down the weights u into u' and u''. Note that although u' is a vector of size Z (one for each feature x_p), whereas u'' is a weights matrix, where u''_{p,q} represents the weight of the interaction between the features x_p and x_q. Thus u'' is of dimensions ZxZ.

Notice how this equation is similar to the equation for FM:


except for the self interaction terms and one more important difference is that, the weights for the pairwise cross interaction terms.

In SVM polynomial kernel of d=2, the weights for the terms x_{kp}*x_{kq} are u''_{p,q} which are independent weights whereas in FM, the weights are v^T_pv_q, which are factorized weights.

The advantages of using factorized weights are:

  1. In case of SVM, each weight u''_{p,q} is learnt independently, thus if the feature matrix is too sparse, the number of parameters to estimate will be too many as compared to the number of data points. Thus the weights will not be learnt well.
    • That is why in case of highly sparse data linear kernels perform much better than higher degree polynomial kernels or RBF kernels.
    • Thus for problems with very sparse data like recommendation engines or text based classification, SVM with linear kernel or Factorization machines should perform better than SVM with higher degree polynomials.
  2. With factorized weights, the factor vectors are shared, for e.g. the feature vector v_p is used to compute the weights for all the interaction terms of p with all other features q.
    • Thus if an user X, has watched and rated 'Avengers' very high and if he has not watched 'Justice League (JL)', then although v^T_{\text{user=X}}v_{\text{movie=Avengers}} would be high, we can infer that v^T_{\text{user=X}}v_{\text{movie=JL}} should also be high from the following:
    • Due to similarity of 'Avengers' and 'JL', the term v^T_{\text{movie=Avengers}}v_{\text{movie=JL}} would be high.
    • For all users X' who are similar to X i.e. v^T_{\text{user=X}}v_{\text{user=X'}} is high, who have rated both 'Avengers' and 'JL'.
  3. Number of unknowns to estimate is O(Z2) with SVM(d=2), whereas with FM(d=2), the number of unknowns to estimate is O(ZH), and in most cases H << Z.

Similar to SVM with polynomial kernels of degree > 2, we can also define FM's with degree > 2.

Whereas the number of unknowns to estimate in case of a SVM kernel with degree d=L, is O(ZL), due to the use of factorized weights, the unknowns to estimate in case of FM of degree=L, is always O(ZH).

If we define " /> as the dot product of L vectors, then we can write the equation for an FM of degree d=L as:

*x_{kp}*x_{kq}*...*x_{kt}" />

The summation term contains interaction of 1, 2, 3, ... to L number of features together.

Let's do some implementation in python and see how the results compare. To begin with, the MovieLens dataset has a total of around 20million ratings for 138k users and 27k movies.

Thus if we create the feature matrix on the entire dataset, we will have for each rating, 138k (user encoding vector) + 27k (movie encoding vector) + 27k (binarized ratings vector for user) + 34k (tagging features) = 226k features. Thus we would have a sparse matrix of size 20m x 226k.

Due to limitations on the CPU memory on my mac-book, I will use only the first 10k ratings only. Also I will combine the tags and genres features for the movies into a single feature vector since the genres of a movie also appears as tags in most cases.

Also we will be working with sparse matrices and sparse matrix operations as much as possible in order to reduce memory usage.

The following code creates the user and movie encoding matrix as sparse matrices.

from scipy.sparse import csr_matrix

uid_mat = csr_matrix(([], ([], [])), shape=(num_ratings, num_users))
mid_mat = csr_matrix(([], ([], [])), shape=(num_ratings, num_movies))

uid_mat[range(num_ratings), uids] = 1
mid_mat[range(num_ratings), mids] = 1

Next we read the tags and genres data from respective files and create a TF-IDF matrix out of the features.

import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

tags_df = pd.read_csv(os.path.join(dataset_path, "tags.csv"), encoding="utf-8", sep=",")

genres_df = pd.read_csv(os.path.join(dataset_path, "movies.csv"), encoding="utf-8", sep=",")

stop_words = set(stopwords.words('english'))

movie_id, tags = list(tags_df[u'movieId']), list(tags_df[u'tag'])

tags = [str(tag) for tag in tags]

movie_tag_map = defaultdict(list)

for idx in range(len(movie_id)):
    tag = tags[idx].lower()
    tag = re.sub("[^a-zA-Z0-9 ]", " ", tag)
    tag = tag.strip()
    tag = re.sub("\s+", " ", tag)
    if len(tag) > 0:
        tag_words = tag.split(" ")
        tag = " ".join([x for x in tag_words if x not in stop_words])
movie_id, genres = list(genres_df[u'movieId']), list(genres_df[u'genres'])

for idx in range(len(movie_id)):
    genre = genres[idx].lower()
    all_genres = genre.split("|")
    for gen in all_genres:

movie_tags = []

for mid in range(n_movies):

vectorizer = TfidfVectorizer(tokenizer=lambda sent: sent.split("###"), ngram_range=(1,1), stop_words='english')
movie_tag_mat = vectorizer.fit_transform(movie_tags)

tag_mat = movie_tag_mat[mids,:]

Then we create the binarized ratings matrix for each rating.

implicit = sparse.csr_matrix((ratings_matrix != 0).astype(int))
imp_mat = implicit[uids,:]

Lastly, we combine all the above matrices into a single feature matrix.

from scipy.sparse import hstack
mat = hstack((uid_mat, mid_mat, tag_mat, imp_mat))

Now we have a feature matrix corresponding to each of the ratings. We can try to fit a regression model on top of this data and then use the model to predict unknown ratings. To evaluate our model, we split the dataset into training (80%) and testing (20%).

X_train, X_test, y_train, y_test = train_test_split(mat, ratings, test_size=0.2)

In order to illustrate our points that:

  1. FM of degree=2 performs better than SVMs of polynomial degree=2
  2. SVMs of 'linear' kernel performs better than 'polynomial' kernels of higher degree.

Let's fit the training data to :

  1. A Support Vector Regression model of polynomial kernel degree=2
  2. A Support Vector Regression model of linear kernel.

With polynomial kernel of degree=2.

from sklearn.svm import SVR

model = SVR(kernel='poly', degree=2), y_train)

With linear kernel.

model = SVR(kernel='linear'), y_train)

Then use the model to predict the ratings on the test data and then compute the RMSE scores of the predictions with the actual ratings for the test data.

from sklearn.metrics import mean_squared_error

preds = model.predict(X_test)
print mean_squared_error(y_test, preds)

The RMSE loss on the test ratings for polynomial kernel of degree=2 is 1.18, while the RMSE loss for the linear kernel is 0.98. Clearly the linear kernel SVM performs better than polynomial kernel.

Then we come to implementing our FM of degree=2.

Let's initialize the weights and factor matrices first. Since we are going to use Adam optimization technique to learn the weights and the factor vectors, we initialize auxiliary weights.

weight_0 = 0.0
weight_0_m, weight_0_v = 0.0, 0.0
weight_k = np.zeros(mat.shape[1])
weight_k_m, weight_k_v = np.zeros(mat.shape[1]), np.zeros(mat.shape[1])

We keep each factor vector to be of size 64.

k = 64
factors = np.zeros((mat.shape[1], k))
factors_m, factors_v = np.zeros((mat.shape[1], k)), np.zeros((mat.shape[1], k))

Next we define the functions for getting the predictions and the errors in predictions (not the RMSE loss).

def get_predictions(selected_data, weight_0, weight_k, factors):
    x =
    y = selected_data.power(2).dot((factors**2))
    z = 0.5 * ((x**2) - y)
    return weight_0 + np.squeeze(np.asarray(selected_data.multiply(weight_k).sum(axis=1))) + z.sum(axis=1)

def get_errors(selected_data, weight_0, weight_k, factors, true_labels):
    preds = get_predictions(selected_data, weight_0, weight_k, factors)
    return true_labels - preds

Note that in order to optimize the computation of the predicted ratings, we use the equation suggested in the original paper by Rendell.

Then we set the learning rate parameters, the regularization constant 'lambda', the batch size etc.

eta, lambdas = 0.001, 0.1
beta1, beta2 = 0.9, 0.999
eps = 1e-8

batch_size = 128

num_iter, losses = 0, []

Lastly we train the FM model using Adam optimization. We evaluate the performance of the learnt model on the testing data after every 10 iterations. Since the feature matrix is huge and we have the term for the interactions of the features, the learning is a quite slow, but it has been optimized with heavy usage of scipy sparse matrix operations.

I am not explaining the matrix operations here.

while True:
    num_iter += 1

    if num_iter % 10 == 0:
        errs_validation = get_errors(X_test, weight_0, weight_k, factors, y_test)
        rmse_loss = np.sqrt(np.mean(errs_validation**2))


        print rmse_loss
    selected_rows = random.sample(range(X_train.shape[0]), batch_size)
    selected_data = X_train[selected_rows,:]
    selected_labels = np.asarray(y_train)[selected_rows]

    errs_train = get_errors(selected_data, weight_0, weight_k, factors, selected_labels)

    #Updates for the weights weight_0
    x, u1, v1 = weight_0, weight_0_m, weight_0_v
    grad = -(np.sum(errs_train) - lambdas * x)
    u1 = beta1 * u1 + (1 - beta1) * grad
    v1 = beta2 * v1 + (1 - beta2) * (grad**2)
    x += -eta * u1/(np.sqrt(v1) + eps)
    weight_0, weight_0_m, weight_0_v = x, u1, v1
    #Updates for the weights weight_k
    x, u1, v1 = weight_k, weight_k_m, weight_k_v
    grad = -( - lambdas * x)
    u1 = beta1 * u1 + (1 - beta1) * grad
    v1 = beta2 * v1 + (1 - beta2) * (grad**2)
    x += -eta * u1/(np.sqrt(v1) + eps)
    weight_k, weight_k_m, weight_k_v = x, u1, v1
    #Updates for the factor vectors
    x, u1, v1 = factors, factors_m, factors_v
    a, b =, selected_data.T.multiply(errs_train).T
    c = selected_data.power(2).T.multiply(errs_train).T.tocsc()
    f =
    g = csr_matrix(([], ([], [])), shape=(x.shape[0], x.shape[1]))
    for k in range(batch_size):
        g += c[k,:].T.multiply(x)
    h = f - g.toarray()
    grad = -(h - lambdas * x)
    u1 = beta1 * u1 + (1 - beta1) * grad
    v1 = beta2 * v1 + (1 - beta2) * (grad**2)
    x += -eta * u1/(np.sqrt(v1) + eps)
    factors, factors_m, factors_v = x, u1, v1

In each iteration, we sample a batch of ratings (batch gradient descent).

Interestingly the learning and the RMSE loss heavily depends on the batch size. This is primarily due to the fact that we are using very few ratings (8k) for training the model and thus when the batch size is low, the sparsity is very high and the model is not able to learn and update the parameters well

Earlier we have seen that with SVD++ algorithm, that even with a batch size of 32, it is able to learn well, but with FM on 8k ratings, a good batch size is 128.

With FM of degree=2, we obtain an RMSE loss of 0.92 on the test data with the above parameter settings. Thus we see that FM's do actually perform much better than SVM, due to the reasons mentioned above.


Tags: , , , , ,