Stokastik

Machine Learning, AI and Programming

LeetCode: Three Equal Parts

Problem Statement Solution: The problem can be approached in multiple different ways. One trivial approach is to find the decimal representation for all binary sequences [i, j], where j >= i. Once we find the decimal representations for all [i, j], we can then find the indices i' and j' such that the decimal representation [0, i'], [i'+1, j'-1] and [j', N-1] are equal where N is the length of […]

Continue Reading →

LeetCode: Minimum Refuel Stops

Problem Statement Solution: At first, this looks like a classic graph traversal problem. For example, I might be tempted to use BFS approach to find the minimum number of steps to reach the target. But one important distinction this problem has from a standard BFS is that in a standard BFS problem, the car can either directly reach a station Y from station X or they cannot depending whether there […]

Continue Reading →

Interfacing C++ with Cython

In the last post we saw how we can use Cython programming language to boost speed of Python programs. For a simple program like finding primes upto a certain N, we obtained a gain of around 50-60x with Cython as compared to a naive Python implementation. This is significant when we are going to deploy our codes in production. A complex system will have multiple such programs calling each other […]

Continue Reading →

Speeding up with Cython

While I was working with the R programming language, I always somehow found it to be slow coming from a JAVA/C++ background. Then I discovered the "RCpp" package which allowed me to write C++ codes and call them from R. It greatly improved my program speed by an order of 50-100x. Most of the coding that I did was in C++ while the interface remained in R. Since I moved […]

Continue Reading →

BiLSTM-CRF Sequence Tagging for E-Commerce Attribute Extraction

In the last post we had used Conditional Random Fields (CRF) to extract attributes from e-commerce product titles and description. CRFs are linear models just like Logistic Regression. The drawback with linear models is that they do not take feature-feature interaction or higher order feature terms into account while building model. Linear models can under-fit on the data while too much non-linearity can lead to over-fitting.┬áNon-linear models such as Neural […]

Continue Reading →

Attribute Extraction from E-Commerce Product Description

In this post we are going to look into how one can use product title and description on e-commerce websites to extract different attributes of the product. This is a very fundamental problem in e-commerce which has widespread implications for Product Search (search filters), Product Matching (matching same items from different sellers), Product Grouping (grouping items by variants such as size and color), Product Graph (relationship between products based on […]

Continue Reading →

Factorization Machines for Movie Recommendations

In the last series of posts we have looked at how to recommend movies to users based on the historical ratings. The two most promising approaches were Collaborative Filtering and Matrix Factorization. Both these approaches learns the user-movie preferences only from the ratings matrix. Recall that in the first post of the series, we had started with an approach known as the Content Based Recommendation, where we created a regression […]

Continue Reading →

Designing Movie Recommendation Engines - Part III

In the last two parts of this series, we have been looking at how to design and implement a movie recommendations engine using the MovieLens' 20 million ratings dataset. We have looked at some of the most common and standard techniques out there namely Content based recommendations, Collaborative Filtering and Latent Factor based Matrix Factorization strategy. Clearly CF and MF approaches emerged as the winners due to their accuracy and […]

Continue Reading →

Designing Movie Recommendation Engines - Part II

In the last post, we had started to design a movie recommendation engine using the 20 million ratings dataset available from MovieLens. We started with a Content Based Recommendation approach, where we built a classification/regression model for each user based on the tags and genres assigned to each movie he has rated. The assumption behind this approach is that, the rating that an user has given to a movie depends […]

Continue Reading →

Designing Movie Recommendation Engines - Part I

In this post, we would be looking to design a movie recommendation engine with the MovieLens dataset. We will not be designing the architecture of such a system, but will be looking at different methods by which one can recommend movies to users that minimizes the root mean squared error of the predicted ratings from the actual ratings on a hold out validation dataset.

Continue Reading →