In the last post we had used Conditional Random Fields (CRF) to extract attributes from e-commerce product titles and description. CRFs are linear models just like Logistic Regression. The drawback with linear models is that they do not take feature-feature interaction or higher order feature terms into account while building model. Linear models can under-fit on the data while too much non-linearity can lead to over-fitting. Non-linear models such as Neural […]
In this post we are going to look into how one can use product title and description on e-commerce websites to extract different attributes of the product. This is a very fundamental problem in e-commerce which has widespread implications for Product Search (search filters), Product Matching (matching same items from different sellers), Product Grouping (grouping items by variants such as size and color), Product Graph (relationship between products based on […]
In the last series of posts we have looked at how to recommend movies to users based on the historical ratings. The two most promising approaches were Collaborative Filtering and Matrix Factorization. Both these approaches learns the user-movie preferences only from the ratings matrix. Recall that in the first post of the series, we had started with an approach known as the Content Based Recommendation, where we created a regression […]
In the last two parts of this series, we have been looking at how to design and implement a movie recommendations engine using the MovieLens' 20 million ratings dataset. We have looked at some of the most common and standard techniques out there namely Content based recommendations, Collaborative Filtering and Latent Factor based Matrix Factorization strategy. Clearly CF and MF approaches emerged as the winners due to their accuracy and […]
In the last post, we had started to design a movie recommendation engine using the 20 million ratings dataset available from MovieLens. We started with a Content Based Recommendation approach, where we built a classification/regression model for each user based on the tags and genres assigned to each movie he has rated. The assumption behind this approach is that, the rating that an user has given to a movie depends […]
In this post, we would be looking to design a movie recommendation engine with the MovieLens dataset. We will not be designing the architecture of such a system, but will be looking at different methods by which one can recommend movies to users that minimizes the root mean squared error of the predicted ratings from the actual ratings on a hold out validation dataset.
Problem Statement Solution : This one looks like a very simple problem at first glance but I found it to be quite tricky during implementation. The straightforward solution is to pre-compute the prefix sums S(i), i.e. the sum of all integers from 0 to i-th index for all possible i, and then compute all possible range sums S(i, j), which is the sum of all integers from index i to […]
In the second post of this series we had listed down different vectorization algorithms used in our experiments for representing questions. Representations form the core of our intent clusters, because the assumption is that if a representation algorithm can capture syntactic as well as semantic meaning of the questions well, then if two questions which actually speak of the same intent, will have representations that are very close to each […]
Problem Statement Solution : This is an interesting problem, not because it is difficult or tricky to find an efficient solution, but there are multiple approaches and each approach is dependent on the problem definition and problem test cases. Identifying words where the space between them has been mistakenly omitted is a classic problem is text processing because this is one way by which search engines such as Google provide […]
Problem Statement Solution : Let's try to build a 'bad' solution first. By 'bad', I mean the approach may not be the most optimal but will return correct results every time. One such approach is to list down all possible substrings and count the unique letters in each of them and then take their sum. This approach is perfectly reasonable approach but why it is not optimal ?