Stokastik

Machine Learning, AI and Programming

Dimensionality Reduction using Restricted Boltzmann Machines

Restricted Boltzmann Machine is an unsupervised machine learning algorithm that is useful for doing dimensionality reduction, classification (deep neural networks), regression, collaborative filtering (for recommendation engines), topic modeling etc. The functionality of RBM's are somewhat similar to PCA (SVD), PLSA, LDA etc., which transforms features from the input data into a lower dimension space, capturing the dependencies between different features. RBM's has also been used successfully in problems involving missing/unobserved data. For […]

Continue Reading →

Classification with Imbalanced Data Sets

In credit card fraud analysis, most datasets are highly skewed since the number of valid transactions far outweighs the number of fraudulent transactions (in most cases, the ratio of valid transactions to fraudulent transactions could be as skewed as 98% to 2%). Without fitting a classification model to the training data if we simply predict any unknown transaction as a valid transaction, we would be correct 98% of the time. Even if we fit a […]

Continue Reading →

OCR error correction techniques

There are two aspects to the OCR correction, one is if the OCR mostly confuses between certain characters across document images, then assuming that the trained document images will be almost similar to what we are going to expect at run time in production, then there is probably no need of OCR correction as the OCR will almost certainly confuse the same characters in testing data and hence the incorrectly […]

Continue Reading →

Examples of Expectation Maximization

In an earlier post, we had introduced the concept of expectation maximization. This tool enables us to compute the unknown parameters of a probability distribution in the absence of complete observation. For example the observation might have incomplete information as we had seen in the coin toss example, or there could be missing values, such as missing class labels for classification and so on. EM is a difficult to understand […]

Continue Reading →

Sampling training documents for multi-class classification

Having huge training data possesses several challenges for training a supervised learner. Larger training time, higher space and memory requirements, overfitting concerns and so on. The first step towards structuring data from unstructured examples is creating a document term matrix. The most common dimension for reducing the dimensionality of the document term matrix is the feature dimension since the number of features by far outnumbers the number of documents. We […]

Continue Reading →

Building a multi-class text classifier from scratch using Neural Networks

In this post we are going to build a artificial neural network from scratch to solve a multi-class text classification problem. Although there exists lots of advanced libraries of NN for the multi-class classification problem written using a variety of programming languages, the idea is not to re-invent the wheel but to understand the process how neural networks work in practice. We will start by explaining the architecture of a neural network model, […]

Continue Reading →

Understanding Convolution for Deep Learning

With the recent advancements in Deep Learning and Artificial Intelligence, there has been continuous interest among machine learning enthusiasts and data scientists to explore frontiers in artificial intelligence on small to medium scale applications that was probably the realm of high speed supercomputers owned by a few tech giants only a few years ago. Few of such applications are Image and Speech Recognition, Language Translators, Automated Image Descriptions, Detecting Phrases […]

Continue Reading →

Dynamic K-Nearest Neighbors for class imbalanced data

K-Nearest Neighbors is one of the more popular supervised machine learning algorithms for classification tasks. The reason it is very popular is that it is pretty easy to implement and in fact anyone with a basic knowledge of programming could implement it from scratch without using any external libraries. The idea is simple. To predict the class label for any unknown document, compute the distances of the unknown document with […]

Continue Reading →

Improving unsupervised clustering with pLSA

In the last post we had explored how we can improve the unsupervised clustering in K-Means algorithm by selecting the initial set of centroids using the K-Means++ algorithm. The sparse implementation in Rcpp outperforms the 'kmeans' method in R both in terms of speed and accuracy, computed using the weighted entropy measure of each cluster. But the implementation of the K-Means++ algorithm suffers from the curse of dimensionality. Although we […]

Continue Reading →

Improving clustering performance with K-Means++ in R and CPP

K-means clustering is a widely used method for unsupervised learning. Given a set of N unlabelled data points in a d-dimensional space, the objective is to group these points into 'k' clusters, such that the sum of distances of all data points from its cluster centroids is minimized. The following objective is to be minimized (assuming that euclidean distance is used to compute the distance metrics): where C denotes the […]

Continue Reading →