Stokastik

Machine Learning, AI and Programming

Category: DESIGN

Fast Nearest Neighbour Search - Product Quantization

In this on-going series of fast nearest neighbor search algorithms, we are going to look at Product Quantization technique in this post. In the last post, we had looked at KD-Trees, which are effecient data structures for low dimensional embeddings and also in higher dimensions provided that the nearest neighbor search radius is small enough to prevent backtracking. Product Quantization or PQ does not create any tree indexing data structure […]

Continue Reading →

Designing large scale similarity models using deep learning

Finding similar texts or images is a very common problem in machine learning used extensively for search and recommendation. Although the problem is very common and has high business value to some organisations, but still this has remained one of the most challenging problems when the database size is very large such as >50GB and we do not want to lose on precision and recall much by retrieving only 'approximately' […]

Continue Reading →

Building a classification pipeline with C++11, Cython and Scikit-Learn

We have earlier seen how using Cython increases the performance of Python code 50-60x, mostly due to static typing as compared to dynamic typing in pure Python. But we have also seen how one can wrap pure C++ classes and functions with Cython and export them as Python packages with improved speed. The codes we have dealt with so far using Cython were mostly generic modules like generating primes using […]

Continue Reading →

Interfacing C++ with Cython

In the last post we saw how we can use Cython programming language to boost speed of Python programs. For a simple program like finding primes upto a certain N, we obtained a gain of around 50-60x with Cython as compared to a naive Python implementation. This is significant when we are going to deploy our codes in production. A complex system will have multiple such programs calling each other […]

Continue Reading →

Designing Movie Recommendation Engines - Part III

In the last two parts of this series, we have been looking at how to design and implement a movie recommendations engine using the MovieLens' 20 million ratings dataset. We have looked at some of the most common and standard techniques out there namely Content based recommendations, Collaborative Filtering and Latent Factor based Matrix Factorization strategy. Clearly CF and MF approaches emerged as the winners due to their accuracy and […]

Continue Reading →

Designing Movie Recommendation Engines - Part II

In the last post, we had started to design a movie recommendation engine using the 20 million ratings dataset available from MovieLens. We started with a Content Based Recommendation approach, where we built a classification/regression model for each user based on the tags and genres assigned to each movie he has rated. The assumption behind this approach is that, the rating that an user has given to a movie depends […]

Continue Reading →

Designing Movie Recommendation Engines - Part I

In this post, we would be looking to design a movie recommendation engine with the MovieLens dataset. We will not be designing the architecture of such a system, but will be looking at different methods by which one can recommend movies to users that minimizes the root mean squared error of the predicted ratings from the actual ratings on a hold out validation dataset.

Continue Reading →

Designing an automated Question-Answering System - Part IV

In the second post of this series we had listed down different vectorization algorithms used in our experiments for representing questions. Representations form the core of our intent clusters, because the assumption is that if a representation algorithm can capture syntactic as well as semantic meaning of the questions well, then if two questions which actually speak of the same intent, will have representations that are very close to each […]

Continue Reading →

Designing a Social Network Site like Twitter

In this post we would be looking at designing a social networking site similar to Twitter. ¬†Quite obviously we would not be designing every other feature on the site, but the important ones only. The most important feature on Twitter is the Feed (home timeline and profile timeline). The feeds on twitter drives user engagement and thus it needs to be designed in a scalable way such that it can […]

Continue Reading →

Designing a Cab Hailing Service like Uber

In this series of posts we will be looking to design a cab hailing service similar to Uber or Ola (in India). We will be mainly concerned about the technical design and challenges and not get into the logistics such as signup and recruitment of drivers, training drivers for customer satisfaction, number of cabs on street and so on. Even for the technical design, we will omit some of the […]

Continue Reading →