Machine Learning, AI and Programming

Generative vs. Discriminative Spell Corrector

We have earlier seen two approaches of doing spelling corrections in text documents. Most of the spelling errors encountered are in either user generated contents or OCR outputs of document images. Presence of spelling errors introduce noise in data and as a result impact of important features gets diluted. Although the methods explained are  different in how they are implemented but theoretically both of them work on the same principle. In both the approaches, we model the joint probability distribution of words. In the first approach, we model the joint probability distribution of consecutive words whereas in the second approach we do not limit ourselves to only consecutive words but model joint probability distribution of multiple words in context.

While I was learning Machine Learning, I encountered a very nice tutorial on Generative vs. Discriminative classifier from the CS229 course material. Even after 2-3 readings of the paper, I still had only a vague understanding of the concept but gradually started to develop a more intuitive understanding. In generative classifiers, we model the joint probability of the input X and the class labels y, P(X, y), i.e. we assume that the data is generated from this distribution and during testing, we use the same probability distribution to infer the class probability distribution using bayesian rule.

Training step : Compute

P(X|y) and P(y), thus P(X, y)=P(X|y)P(y),

i.e. the joint probability of the instance X and class label y.

Testing step : Compute

P(y|X)=\frac{P(X, y)}{P(X)}=\frac{P(X, y)}{\sum_yP(X|y)P(y)}

Some commonly used generative classifiers are Naive Bayes and the Hidden Markov Model. In Naive Bayes, we model the joint probability distribution of the words (or features) and the class labels, whereas in HMM, we model the joint probability distribution of consecutive hidden states P(s_t, s_{t-1}) and the the joint probability distribution of hidden state and observed value P(o_t, s_t).

Whereas in discriminative classifiers such as Logistic Regression or Neural Networks, we directly model the conditional probability of a class given the input X,

P(y|X)=Q(y|X; \Theta)

for some probability distribution function Q and unknown parameters \Theta, or directly output a class label given the data point and the parameters such as in SVM.

During training we infer the assignment of \Theta that either maximizes the likelihood of the input data or minimize some loss function. During testing we re-use the same values of \Theta to compute the likelihood of a class label y given the input X' as Q(y|X'; \Theta) or just obtain the predicted class label.

For example in binary logistic regression, the class conditional probabilities are given as the sigmoid function:

Q(y=1|X; \Theta)=h_{\Theta}(X)=\frac{1}{1+e^{-{\Theta}X}} and Q(y=0|X; \Theta)=1-h_{\Theta}(X)

The parameters \Theta are computed by minimizing the cross entropy loss function :


The same conditional probabilities are also true for a 2-class neural network. For a multi-class neural network, the conditional probabilities are given by the softmax function :

Q(y=y_j|X; \Theta)=\frac{e^{{\Theta_j}X}}{\sum_ke^{-{\Theta_k}X}}

In this post, we will demonstrate this difference between generative vs. discriminative classifier with the help of spelling corrector. In earlier posts on solving the misspelling problem, we modeled the joint probability distribution of words. In the first approach, we compute the probability of the next and previous words given the middle word in a sequence from the given data, i.e. :

P(w_{t-1}=W_j|w_t=W_i) and P(w_{t+1}=W_k|w_t=W_i)

and also the frequency distribution of the words, i.e. P(w_t), where w_t implies the word at position 't' in a document.

which is same as computing the joint probability distribution, because

P(w_{t-1}=W_j, w_t=W_i)=P(w_{t-1}=W_j|w_t=W_i)P(w_t=W_i)

and used them to infer the probable correct words w_t in between w_{t-1} and w_{t+1}, i.e. :

P(w_t|w_{t-1})=\frac{P(w_t, w_{t-1})}{\sum_{w_t}P(w_{t-1}|w_t)P(w_t)}, and P(w_t|w_{t+1})=\frac{P(w_t, w_{t+1})}{\sum_{w_t}P(w_{t+1}|w_t)P(w_t)}

This is similar to our definition of the generative classifiers, only that instead of class labels, we use the words themselves to be the labels. After computing the probabilities P(w_t|w_{t-1}) and P(w_t|w_{t+1}), we score the possible words as follows :


i.e. consider the most probable word to be the one which is predicted by one or both the distribution with very high probabilities.

Note that instead of computing the score as the product of the probabilities (as in naive bayes independence assumption) :


we are summing the probabilities, because if either the previous word or the next word in the sequence is also a misspelling or was not present in the training corpus, then the score with product of probabilities will be very small even if the other probability in the sequence is very high. This is just another way of saying that the middle word should be predicted with high probability by either the previous or the next word or both.

For example, consider the phrase "machine learning algorithms".

During training on this data, we would compute the following joint probabilities :

P("machine", "learning") and P("algorithms", "learning")

since the phrase "machine learning algorithms" or "machine learning" or "learning algorithms" would occur across many documents compared to "machine leaming" or "leaming algorithms", thus one or both the probabilities would be quite high. Thus during inference, we would see that, the probability of the middle word being "learning", i.e. :

P("learning" | "machine") + P("learning" | "algorithms")

computed using the Bayes rule, is comparably higher than any other possible word from the corpus. So for a testing instance, if we encounter the phrase "machine leaming algorithms", we would know that the misspelling "leaming" is actually "learning" by using the Levenshtein distance method.

In the second approach, instead of limiting ourselves only to previous and next word in the sequence, we compute the joint probabilities of words within a context window. Consider the below phrase :

"understanding machine learning algorithms requires knowledge of maths"

Thus for the target word "learning" in the above sentence and with a context size of 2, we would be computing the following joint probabilities during training :

P("understanding", "learning"), P("machine", "learning"), P("algorithms", "learning") and P("requires", "learning")

In the inference step, we similarly use the Bayes rule to infer the conditional probabilities of a target word given the surrounding words in context and similar as above compute the score as the sum of the conditional probabilities of a target word.

S(w_t)=\sum_{k=-C, k{\ne}0}^C[1+P(w_t|w_{t+k})]

We added a constant of 1 to each conditional probability estimate, to give more weightage to the fact that a target word predicted by a higher number of context words should score higher compared to a target word predicted by lower number of context words. The final corrected word is computed by ordering the target words by their scores (highest to lowest) and considering only the top K target words. Then we use Levenshtein distance to compute the distance between the misspelled word and each possible target word. Based on the minimum distance we select the correct target word.

We have not yet explored building a discriminative classifier for correcting spellings in text documents. We will do that in this post.

In order to build a discriminative classifier, we need to define the inputs and the outputs (true labels) for the classifier and also what kind of objective function we want to optimize. Regarding choosing a classifier, we will go with a Neural Network classifier. By default, most neural networks libraries use the logistic loss for binary classification or the cross entropy loss function for multi-class classification.

There could be potentially many different ways to define the inputs and outputs to a neural network for enabling spelling correction capabilities. The most straightforward is to extend the same logic that we used in the case of the generative solution, i.e. the inputs are the context words and the output is the target word in each training instance. Thus for the sentence :

"understanding machine learning algorithms requires knowledge of maths"

with a context window size of 2, the training instances would look like :

  • Input : ["machine", "learning"], Output : ["understanding"]
  • Input : ["understanding", "learning", "algorithms"], Output : ["machine"]
  • Input : ["understanding", "machine", "algorithms", "requires"], Output : ["learning"]
  • Input : ["machine", "learning", "knowledge", "requires"], Output : ["algorithms"]

and so on. This looks familiar to the training of word vectors with the Continuous Bag Of Words model. But the objective is different in this case. We do not want to learn vector representations of words but want to predict the target word given the context words. Also we may not want to preserve any linear relations between words thus we can introduce back the non-linear activation function in the hidden layers.

Neural Network for Learning Target Words from Context Words

In order to implement this neural network based spelling correction model, we will re-use the Word2Vec code from the python gensim library. We will train the word2vec model on context size of 10, and after training is completed we will use the function 'predict_output_word' to predict the target word probability distribution given the context words w.r.t. a misspelled word. With the top 30 most probable target words, we will use the Edit distance method to compute which target word is closest match to the misspelled word.

We create two files, first one is that contains all functions related to file processing and tokenizing data and the other file contains all functions related to training and testing the spelling corrector.

In, we add the following functions :

import nltk, os, pickle
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

def getContents(type='train'):
    mydata = fetch_20newsgroups(subset=type, shuffle=True, random_state=42)

    return [" ".join(data.split("\n")) for data in]

def myTokenizer(text):
    return nltk.regexp_tokenize(text, "\b[a-zA-Z]{3,}\b")

def tokenizeContents(contents):
    return [myTokenizer(content) for content in contents]

def getTokens(contents):
    tokens = tokenizeContents(contents)
    tokens = [[word.lower() for word in token] for token in tokens]

    return tokens

The first function fetches the 20 Newsgroup data and the remaining functions are used to tokenize this data into words.

We add the following functions to to train the word2vec model :

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
import Utilities, pickle
import gensim, logging, os, editdistance

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def trainWord2VecModel(contents):

    print "Getting tokens..."
    tokens = Utilities.getTokens(contents)

    print "Training word2vec..."
    model = gensim.models.Word2Vec(tokens, min_count=10, window=10, size=300, sample=1e-5, workers=4, hs=1, iter=10)'word2vec')

    return model

trainContents = Utilities.getContents('train')
testContents = Utilities.getContents('test')

contents = trainContents + testContents

w2vModel = trainWord2VecModel(contents)

The above code will train the word2vec model on the entire (training + testing) 20 Newsgroup data. Next we define the function to do the spelling correction :

def spellCorrectW2V(w2vModel, word, contextWords, max_num_corrections=2):
    possibleWords = w2vModel.predict_output_word(contextWords, topn=30)
    print possibleWords

    minDist = max_num_corrections+1
    correctedWord = word

    for possibleWord, prob in possibleWords:
        dist = editdistance.eval(possibleWord, word)
        if dist < minDist and dist >= 1 and dist <= max_num_corrections and dist/len(word) <= 0.4:
            minDist = dist
            correctedWord = possibleWord

    return correctedWord

In this function first we call the 'predict_output_word' method with the context words and returns the top 30 most probable target words. Then we compute the edit distance of each target word with the misspelled word to determine which target word is the closest match to the incorrect word.

For example, in the original text

"purdue university engineering computer network distribution",

the university name "purdue" has been incorrectly misspelled as "pundue", i.e. the text is now :

"pundue university engineering computer network distribution"

To obtain the correct word, we call the above function with the context words

["university", "engineering", "computer", "network", "distribution"] and the word to be corrected ["pundue"]

print spellCorrectW2V(w2vModel, "pundue", "university engineering computer network distribution".split())
[(u'engineering', 0.012115832), (u'computer', 0.0080204243), (u'computing', 0.0053198021), (u'electrical', 0.0049158949), (u'mellon', 0.0045037922), (u'network', 0.0044503575), (u'purdue', 0.002529941), (u'carnegie', 0.0023504316), (u'science', 0.0019142606), (u'school', 0.001906618), (u'lin', 0.0017148891), (u'graduate', 0.0017083976), (u'avenue', 0.0016733981), (u'arpa', 0.0015475156), (u'pennsylvania', 0.0014345722), (u'department', 0.0012977718), (u'dept', 0.0012234363), (u'pricing', 0.0011592754), (u'robotics', 0.0011309865), (u'usa', 0.0011275145), (u'inet', 0.0010072426), (u'automation', 0.00097113283), (u'gurus', 0.00092486449), (u'forsale', 0.00089894817), (u'sciences', 0.00088622473), (u'keywords', 0.00088370551), (u'wang', 0.00088335789), (u'maryland', 0.00086987048), (u'ecn', 0.00083536567), (u'brookline', 0.00080532813)]

Returns the above target word probability distribution. Notice that the correct word "purdue" is the 7th most probable target word. Thus our algorithm returns "purdue" as the corrected version of "pundue".

Compared to the generative version of this algorithm, the discriminative algorithm ideally should perform better (although not verified for this particular application) because we are training the objective function to maximize the prediction probabilities of the true target word which is the output we desire. In the generative algorithm, we are training a context word with a target word independent of other context words (similar to naive bayes assumption), whereas in the neural network model, the weights for all context words are simultaneously updated for a target word.

The training of the generative classifier is much faster compared to the discriminative neural network classifier (because we only need to count probabilities) but prediction with neural network approach is much faster compared to the generative approach. But the problem with the word2vec approach is that it generally requires a lot of data to correctly build the model since the output nodes corresponds to the target words and the number of training instances corresponding to each target word would be too less for the algorithm to converge.

In another experiment, I tried out a variation of the discriminative approach, where instead of learning target words from surrounding context, I try to learn that given some characters in a word and a relative ordering of characters, which character(s) are missing. In this example, to generate one training instance, we take out one character from the word and based on the remaining word, generate input features to be the unigram characters and ordered bigrams.

For example, given the word "machine", there would be 7 training instances corresponding to each character position in the word. If we assume that the first character 'm' is missing, then the remaining word would be "achine" and the input features would be :

["a", "c", "h", "i", "n", "e", "ac", "ah", "ai", "an", "ae", "ch", "ci", "cn", "ce", "hi", "hn", "he", "in", "ie", "ne"]

and the corresponding output would be ["m"].

Similarly, assuming "h" is missing, then :

Input : ["m", "a", "c", "i", "n", "e", "ma", "mc", "mi", "mn", "me", "ac", "ai", "an", "ae", "ci", "cn", "ce", "in", "ie", "ne"]

Output : ["h"]

Thus for each word of length N, there would be N training instances assuming only a single character is missing, we can extend the same logic with two or more characters missing, in that case the number of instances would increase exponential in N. But we will be working only with 1 missing character. Hence the maximum number of input units for the neural network would be 26 + 26*26 = 702 and the number of output units would be 26.

The idea behind using the above strategy is as follows : For example if the word "machine" is misspelled as "machire", then for each character position in "machire", we will remove that character and with the remaining word, generate the features (unigrams and ordered bigram characters) as above and then use the neural network model to predict the most likely missing character for that position. So if we remove "r" in "machire" and then compute the most likely replacement for "r", we would obtain "n" and then check for the new word in the vocabulary.

Note that training with 1 missing character scenario will not work straightforward when the misspelling is at an edit distance of 2 or more from the correct word. We have developed a workaround for that but the method is not scalable. Also the above method does not work well if it is possible that for multiple replacements at the same position, we obtain words from vocabulary. For example, if the misspelled word is "lable", then the possible replacement for the first character "l" could be "t" (table), "c" (cable), "f" (fable) as all of them are words from the vocabulary.

The same could be said for the scenario, where if we replace characters at different positions in the word, we again obtain multiple words from the vocabulary. For example, consider the incorrect word "vind", if we replace "v" with "w" we get "wind" which is part of vocabulary, similarly replacing "d" with "e", we get "vine" which is also part of the vocabulary, so we do not know which one is the correct word here.

Note that for edit distance 1, we cannot only assume that a character in the word has changed, it can be that a character has been deleted or a character has been inserted wrongly. We need to take care of these scenarios during testing. To handle insertion of a character, we replace a character with an empty string.

Likewise to handle deletion of characters, not only we replace a character position with another character but also replace the in-between character positions with another character (this is simply adding another character to the word). Thus for a word of length N, we need to check the model for 2N+1 positions, N for each character position and N+1 for the in-between positions, i.e.


To generate the training features we add the following functions in file :

def breakWordIntoNgrams(word, ngram_size=2):
    if (ngram_size == 1):
        return list(word)
        ngrams = []
        for i in range(len(word) - ngram_size + 1):
            out = breakWordIntoNgrams(word[i + 1:], ngram_size - 1)
            out = [word[i] + x for x in out]
            ngrams = ngrams + out

    return ngrams

def generateInstanceFromWord(word, min_ngram_size, max_ngram_size):
    tokens = []
    for ngram_size in range(min_ngram_size, max_ngram_size + 1):
        tokens = tokens + breakWordIntoNgrams(word, ngram_size)

    return " ".join(tokens)

def getInstances(words, wordCounts, min_ngram_size=1, max_ngram_size=2, min_word_count=10, max_word_count=50):
    inputs = []
    outputs = []

    currWordCounts = dict()

    for word in words:
        if wordCounts[word] >= min_word_count and (word not in currWordCounts or currWordCounts[word] < max_word_count):
            if word not in currWordCounts:
                currWordCounts[word] = 1
                currWordCounts[word] = currWordCounts[word] + 1
            for i in range(len(word)):
                subword = word[0:i] + word[i + 1:len(word)]
                inp = generateInstanceFromWord(subword, min_ngram_size, max_ngram_size)

    return dict({'Inputs': inputs, 'Outputs': outputs})

def getWordCounts(words):
    wordCounts = dict()

    for word in words:
        if word in wordCounts:
            wordCounts[word] = wordCounts[word] + 1
            wordCounts[word] = 1

    return wordCounts

def getWords(contents):
    vectorizer = TfidfVectorizer(stop_words='english', tokenizer=myTokenizer)

    vocabulary = set(vectorizer.vocabulary_.keys())

    docWords = tokenizeContents(contents)

    outWords = [word for words in docWords for word in [word.lower() for word in words] if word in vocabulary]

    return outWords

The first function 'breakWordIntoNgrams' generates the unigram characters and the ordered bigrams for a word. The 'getInstances' function generates the inputs and outputs for the neural network to train. Only words with frequency count of at-least 10 are considered (assuming that words with less than 10 occurrences are generally incorrect words or noise). Also words with frequency counts greater than 50 are limited to only 50 occurrences.

The functions for training and testing with this input-output structure are defined in the file as follows :

def trainNN(trainData, trainLabels):
    clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100), random_state=1), trainLabels)

    return clf

def train(instances, vectorizer):
    trainData = vectorizer.fit_transform(instances['Inputs'])
    trainLabels = instances['Outputs']

    return trainNN(trainData, trainLabels)

def test(clf, vectorizer, words):
    instances = [Utilities.generateInstanceFromWord(word, 1, 2) for word in words]
    testData = vectorizer.transform(instances)

    return [[x for (y, x) in sorted(zip(pred, clf.classes_), reverse=True)] for pred in clf.predict_proba(testData)]

def getSuggestions(clf, vectorizer, word, wordCounts, max_num_corrections=2, incorrectWordCountThreshold=10):
    possibleWords = []

    for i in range(2 * len(word)+1):
        if i % 2 == 0:
            leftSubWord, rightSubWord = word[0:(i / 2)], word[(i / 2):len(word)]
            leftSubWord, rightSubWord = word[0:(i - 1) / 2], word[((i - 1) / 2) + 1:len(word)]

        subword = leftSubWord + rightSubWord
        res = test(clf, vectorizer, [subword])

        res[0] = [''] + res[0]

        if max_num_corrections == 1:
            corrRange = 3
            corrRange = len(res[0])

        for j in range(corrRange):
            char = res[0][j]
            possibleWord = leftSubWord + char + rightSubWord

            if (max_num_corrections == 1):
                x = possibleWord in wordCounts and wordCounts[possibleWord] > incorrectWordCountThreshold
                a = x and word in wordCounts and wordCounts[possibleWord] > wordCounts[word]
                b = x and word not in wordCounts

                if a or b:
                possibleWords = possibleWords + getSuggestions(clf, vectorizer, possibleWord, wordCounts,
                                                               max_num_corrections - 1)

    return possibleWords

def spellCorrect(clf, vectorizer, word, wordCounts, max_num_corrections=2):
    suggestions = getSuggestions(clf, vectorizer, word, wordCounts, max_num_corrections)

    suggestions = set(suggestions)

    maxSuggestedWordCount = 0
    bestSuggestedWord = word

    for suggestion in suggestions:
        if wordCounts[suggestion] > maxSuggestedWordCount:
            maxSuggestedWordCount = wordCounts[suggestion]
            bestSuggestedWord = suggestion

    return bestSuggestedWord

def trainNNWordModel(contents):

    print "Getting words..."
    words = Utilities.getWords(contents)

    print "Getting word counts..."
    wordCounts = Utilities.getWordCounts(words)

    print "Getting instances..."
    instances = Utilities.getInstances(words, wordCounts, min_word_count=10, max_word_count=30)

    vectorizer = CountVectorizer(binary=True, token_pattern='\b[a-zA-Z]+\b')

    print "Training Neural Network..."
    clf = train(instances, vectorizer)

    model = dict({'classifier': clf, 'counts': wordCounts, 'vectorizer': vectorizer})

    print "Saving model..."
    pickle.dump(model, open('model.sav', 'wb'))

    return model

The 'trainNN' function trains a neural network classifier with one hidden layer of 100 units. The 'getSuggestions' function returns a set of possible suggestions for an incorrectly spelled word. For finding the correct word at an edit distance of 1, it is straightforward, where we check the top 3 predictions from the neural network for each of the 2N+1 positions described as above. For finding the correct word at an edit distance of 2, first we generate all possible words at a distance of 1, and then recursively obtain the suggestions with these words.

Right now, in the code, if there are multiple possible suggestions for the correct word, we take the one with the highest frequency of occurrence. As you understand that this approach is not correct since we saw earlier that if the correct word is "table" but its frequency is less than "cable" then "cable" would be assumed to be the correct word. A better solution would to combine with the approach of predicting target word given context words. We can use the generative classifier described earlier to determine the correct word from the possible suggestions instead of using word frequency counts.

Another possible solution is to add the context words as inputs to the neural network in addition to the character n-grams described earlier. In this way if the characters in the word and their ordering are not sufficient to identify the correct word, then the context words would help to distinguish between the different possible correct words. With this approach, the number of input units in the worst case would be the possible number of words in the vocabulary.

For example, given the phrase :

"dinner is being served at the dining table"

Then with a context size of 3 (ignoring stop-words), the inputs to the NN for the word "table", assuming that "t" is missing, would be :

["dinner", "served", "dining", "a", "b", "l", "e", "ab", "al", "ae", "bl", "be", "le"]

Now if "table" is misspelled as "lable", then in the place of the first character "l", we will predict "t" with very high confidence.

The entire code is shared on my Github page.


Tags: , , , ,