Stokastik

Machine Learning, AI and Programming

Building an Incremental Named Entity Recognizer System

In the last post, we saw how to train a system to identify Part Of Speech tags for words in sentences. In essence we found out that discriminative models such as Neural Networks and Conditional Random Fields, outperforms other methods by 5-6% in prediction accuracy. In this post, we will look at another common problem in Natural Language Processing, known as the Named Entity Recognition (NER in short). The problem […]

Continue Reading →

Building a POS Tagger with Python NLTK and Scikit-Learn

In this post we are going to understand about Part-Of-Speech Taggers for the English Language and look at multiple methods of building a POS Tagger with the help of the Python NLTK and scikit-learn libraries. The available methods ranges from simple regular expression based taggers to classifier based (Naive Bayes, Neural Networks and Decision Trees) and then sequence model based (Hidden Markov Model, Maximum Entropy Markov Model and Conditional Random […]

Continue Reading →

Understanding Conditional Random Fields

Sequence modeling and prediction is a very common data mining task. The most common use case for sequence modeling is in NLP (POS-Tagging and Named-Entity Recognition). NER can have many possible variants depending on the named entities such as identifying city name, country name, pin codes etc. in addresses, identifying movie name, actors, genre etc. in movie reviews and so on. Apart from NER, some other use cases of sequence […]

Continue Reading →

Common and not so common Machine Learning Questions and Answers (Part III)

Which loss function is better for neural network training, logistic loss or the squared error loss and why ? The loss function depends mostly on the type of problem we are solving and the activation function. In case of regression where the values from the output units are normally distributed, the squared error is the preferred loss function whereas in a classification problem the output units follows the multinomial distribution, the […]

Continue Reading →

Optimization Methods for Deep Learning

In this post I am going to give brief overview of few of the common optimization techniques used in training a neural network from simple classification problems to deep learning. As we know, the critical part of a classification algorithm is to optimize the loss (objective) function in order to learn the correct parameters of the model. The type of the objective function (convex, non-convex, constrained, unconstrained etc.) along with […]

Continue Reading →

Common and not so common Machine Learning Questions and Answers (Part II)

Why does negative sampling strategy works during training of word vectors ? In word2vec training the objective is to have semantically and syntactically similar words close to each other in terms of the cosine distance between their word vectors. In the skip-gram architecture, the probability of a word 'c' being predicted as a context word at the output node, given the target word 'w' and the input and output weights […]

Continue Reading →

Generative vs. Discriminative Spell Corrector

We have earlier seen two approaches of doing spelling corrections in text documents. Most of the spelling errors encountered are in either user generated contents or OCR outputs of document images. Presence of spelling errors introduce noise in data and as a result impact of important features gets diluted. Although the methods explained are  different in how they are implemented but theoretically both of them work on the same principle. […]

Continue Reading →

Common and not so common Machine Learning Questions and Answers (Part I)

What is the role of activation function in Neural Networks ? The role of the activation function in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weighted inputs. A neural network classifier is essentially a logistic regression classifier without the hidden layers. The non-linearity to a neural network is added by the hidden layers using a sigmoid or similar activation functions.

Continue Reading →

Using Word Vectors in Multi-Class Text Classification

Earlier we have seen how instead of representing words in a text document as isolated features (or as N-grams), we can encode them into multidimensional vectors where each dimension of the vector represents some kind semantic or relational similarity with other words in the corpus. Machine Learning problems such as classification or clustering, requires documents to be represented as a document-feature matrix (with TF or TF-IDF weighting), thus we need some […]

Continue Reading →

Designing a Contextual Graphical Model for Words

I have been reading about Word Embedding methods that encode words found in text data into multi-dimensional vectors. The purpose of encoding into vectors is to give "meaning" to words or phrases in a context. Traditional methods of classification treat each word in isolation or at-most use a N-gram approach i.e. in vector space, the words are represented as one-hot vectors which are sparse and do not convey any meaning whereas learning vector […]

Continue Reading →