Stokastik

Machine Learning, AI and Programming

Tag: K-Means

Fast Nearest Neighbour Search - Product Quantization

In this on-going series of fast nearest neighbor search algorithms, we are going to look at Product Quantization technique in this post. In the last post, we had looked at KD-Trees, which are effecient data structures for low dimensional embeddings and also in higher dimensions provided that the nearest neighbor search radius is small enough to prevent backtracking. Product Quantization or PQ does not create any tree indexing data structure […]

Continue Reading →

Designing an Automated Question-Answering System - Part II

In this post we will look at the offline implementation architecture. Assuming that, there are currently about a 100 manual agents, each serving somewhere around 60-80 customers (non-unique) a day, i.e. a total of about 8K customer queries each day for our agents. And each customer session has an average of 5 question-answer rounds including statements, greetings, contextual and personal questions. Thus on average we generate 40K client-agent response pairs […]

Continue Reading →

Initializing cluster centers with K-Means++

In K-Means algorithm, we are not guaranteed of a global minima since our algorithm converges only to a local minima. The local minima and the number of iterations required to reach the local minima, depends on the selection of the initial set of random centroids. In order to select the initial set of centroids for the K-Means clustering, there are many proposed methods, such as the Scatter and Gather methods, […]

Continue Reading →

Selecting the optimum number of clusters

Clustering algorithms comes with lots of challenges. For centroid based clustering algorithms like K-Means, the primary challenges are : Initialising the cluster centroids. Choosing the optimum number of clusters. Evaluating clustering quality in the absence of labels. Reduce dimensionality of data. In this post we will focus on different ways of choosing the optimum number of clusters. The basic idea is to minimize the sum of the within cluster sum […]

Continue Reading →