Machine Learning, AI and Programming

Building a POS Tagger with Python NLTK and Scikit-Learn

In this post we are going to understand about Part-Of-Speech Taggers for the English Language and look at multiple methods of building a POS Tagger with the help of the Python NLTK and scikit-learn libraries. The available methods ranges from simple regular expression based taggers to classifier based (Naive Bayes, Neural Networks and Decision Trees) and then sequence model based (Hidden Markov Model, Maximum Entropy Markov Model and Conditional Random Fields).

Commonly used POS in the english language are Nouns, Pronouns, Verbs, Adverbs, Adjectives, Preposition, Conjunctions, Article and Interjection. For example in the sentence, "Paris is the capital of France", "Paris" is a noun, "is" is a verb, "the" is an article or determiner, "capital" is noun, "of" is conjunction, "France" is noun. I am sure most of us have forgotten how to determine which word corresponds to which POS given a sentence. I had to review some of these concepts.

In POS tagging, tags are represented by abbreviations. For example, proper nouns are represented by NNP, adjective by JJ, adverb by RB, past tense verb by VBD and so on. Penn Treebank has the comprehensive list of the abbreviations. Although it helps to have understanding of the language in context and its part of speeches in order to come up with better features for training a POS tagger, but in this post we will be reusing features used elsewhere.

How does POS tagging help in machine learning and natural language processing tasks ?

  • POS tags of the words in text documents can be used to select words or N-grams as features for text classification, which are only proper nouns or noun phrases (after chunking) thus enabling feature reduction. Because verbs, adverbs, adjectives etc. are not very useful features for classification purpose (although this assumption works for classification within specific domains such as financial documents, news articles etc. verbs/adverbs/adjectives plays an important role for spam/ham classification or sentiment analysis).
  • POS tags can be used to extract named entities from a text (next post), for example given a corpus of movie reviews, one can extract movie names, actor names etc. from the reviews and can be used to formulate relationships between different movies or actors. Later, one can answer queries such as
    • "Which actors have acted in the movie XYZ ?" or
    • "How many times actor A has acted alongside actor B ?" and so on.
  • POS tags can be used to chunk texts into phrases. In text classification, we mostly use N-grams i.e. fixed number of words (usually 1, 2 or 3) in a feature. Each N-gram may not convey any meaning at all, but nonetheless they can be used as useful features for the classifier. The drawback is that if certain phrases (assuming words are not removed based on their POS tags) of arbitrary lengths are useful features, then although bi-grams acts as a substitute for phrases, but the number of bi-grams required to encapsulate a phrase of length N, would be linear in N, whereas with phrase chunking, one can obtain an entire phrase as features and thus reduce the number of features.
  • Spell Correction : Identifying the possible POS tags for an incorrectly spelled word based on the correct POS tags of the neighboring words, helps us to reduce the search space for the correct word (as now the correct word will be constrained from within the possible POS tags, for this we need to map POS tags to a set of valid words in the corpus).

Python NLTK libraries comes with tagged corpuses for english language and also contains methods for predicting the tags given a sentence as input. For example, in order to generate the POS tags for the sentence : "Paris is the capital of France"

One can use the following code :

from nltk import word_tokenize, pos_tag

print pos_tag(word_tokenize("Paris is the capital of France"))

The output is a list of tuple (word, tag) as follows :

[('Paris', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('capital', 'NN'), ('of', 'IN'), ('France', 'NNP')]

We are going to use the Brown corpus for training, testing and evaluation of our POS tagger.

from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents()

print brown_tagged_sents[0]

Each sentence is an ordered list of words as they appear in a sentence, along with their POS tags :

[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]

Instead of starting with a rule based or regex based tagger, we are straightaway going to use machine learning algorithms to model POS tags in the brown corpus. But before diving into machine learning algorithms for POS tagger, few other methods deserves a mention such as the UnigramTagger and the general N-gram Tagger.

In the UnigramTagger, we create a map from each word in the corpus to the most likely tag for that word in the corpus. Since a word can have multiple possible tags depending on the context in which the word is mentioned, but we ignore the context in the Unigram case and consider the tag for that word, which is most common across the corpus. For example, the word "frequent" is mapped to the tag JJ (adjective) even if there are instances where "frequent" is used as a verb (as in "I frequent this cafe").

In the general N-gram tagger, instead of creating a map from each word to its most likely tag in the training instances, we create a map from the tuple

[(w_i, t_{i-1}, t_{i-2}, ...., t_{i-(N-1)})]

to the most likely tag for this tuple, where t_{i-1} refers to the POS tag (during training this is the actual tag, whereas during testing, this is the predicted tag) of the previous word w_{i-1}, t_{i-2} refers to the POS tag for the previous to previous word w_{i-2} and so on i.e. for a Bi-gram tagger, we would first find out the set of all possible tags T_i for the word w_i given that the predicted tag of the previous word is t_{i-1} and then extract the tag from T_i which occurs the maximum number of times.

In a sense, N-gram taggers, takes previous context into account and should perform better than the Unigram taggers.

Tri-Gram Tagger

Below is a python code for training and testing with N-gram taggers. We split the brown corpus into 70:30 training and testing and then generate the N-gram model on the 70% training data and evaluate on remaining 30% testing data.

import nltk
from nltk.corpus import brown
from pickle import dump

def ngramTagger(train_sents, n=2, defaultTag='NN'):
    t0 = nltk.DefaultTagger(defaultTag)

    if (n <= 0):
        return t0
    elif (n == 1):
        t1 = nltk.UnigramTagger(train_sents, backoff=t0)
        return t1
    elif (n == 2):
        t1 = nltk.UnigramTagger(train_sents, backoff=t0)
        t2 = nltk.BigramTagger(train_sents, backoff=t1)
        return t2
        t1 = nltk.UnigramTagger(train_sents, backoff=t0)
        t2 = nltk.BigramTagger(train_sents, backoff=t1)
        t3 = nltk.TrigramTagger(train_sents, backoff=t2)
        return t3

brown_tagged_sents = brown.tagged_sents()

size = int(len(brown_tagged_sents) * 0.7)

tags = [tag for (word, tag) in brown.tagged_words()]
defaultTag = nltk.FreqDist(tags).max()
print defaultTag

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

tagger = ngramTagger(train_sents, 2, defaultTag)
print tagger.evaluate(test_sents)

Observe the "ngramTagger" function. For the value of n less than 1, we are using a default tagger, i.e. a tagger that assigns each word with the tag, which is most common across the corpus (not for that word in particular as in Unigram Tagger). For example, if the POS tag 'NN' occurs in 60% of the words in the corpus. Then if we assign the tag 'NN' to all un-tagged words, we are at-least 60% accurate. This is the logic behind the DefaultTagger.

For each unigram, bigram and trigram taggers, we are using a backoff parameter, which says that if we are unable to tag a word using a tagger, then fall back to the backoff tagger. For example, in the unigram tagger, if a word occurs in testing data but was never encountered in training data, then the unigram tagger would assign 'None' to this word, but instead if we assign the most common tag (from DefaultTagger) as a fall back option, then we are more likely to be correct.

Similarly for a Bigram tagger, the joint probability of a particular word 'w' and a particular predicted POS tag for the previous word 't' occurring in testing data is much less compared to only the particular word occurring in testing data

P(w_i=w, t_{i-1}=t)\;<=\;P(w_i=w)

as a result many more words in testing data will be assigned 'None'. Thus a fallback option is to use the Unigram tagger in case where we are not able to find the (w_i=w, t_{i-1}=t) pair.

Following are the accuracy numbers obtained with various values of n :

  • n=0 : 11.12% (DefaultTagger)
  • n=1  : 88% (UnigramTagger) with backoff, 87.2% without backoff
  • n=2 : 90.19% (BigramTagger) with backoff, 31.1% without backoff
  • n=3 : 90.21% (TrigramTagger) with backoff, 17.2% without backoff

We do not gain much in accuracy while going from a bigram to trigram tagger. Note that without backoff accuracy numbers decreases as we increase the value of n.

Next we look at how we can train a POS tagger using some common machine learning algorithms. But first we have to decide on what features might be important to learn POS tags. N-gram tagger approach prevents us from using more complex features such as the prefix and suffix of the word, or is the word in all uppercase/lowercase and other such properties of a word. For example, the 2-letter suffix is a great indicator of past-tense verbs, ending in "-ed". 3-letter suffix helps recognize the present participle ending in "-ing". Adverbs mostly ends with the suffix "-ly".

Below is the Python function for generating the features for each word (at position "index") in a "sentence" :

def features(sentence, index):

    currWord = sentence[index][0]

    if (index > 0):
        prevWord = sentence[index - 1][0]
        prevWord = '<START>'

    if (index < len(sentence)-1):
        nextWord = sentence[index + 1][0]
        nextWord = '<END>'

    return {
        'word': currWord,
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'curr_is_title': currWord.istitle(),
        'prev_is_title': prevWord.istitle(),
        'next_is_title': nextWord.istitle(),
        'curr_is_lower': currWord.islower(),
        'prev_is_lower': prevWord.islower(),
        'next_is_lower': nextWord.islower(),
        'curr_is_upper': currWord.isupper(),
        'prev_is_upper': prevWord.isupper(),
        'next_is_upper': nextWord.isupper(),
        'curr_is_digit': currWord.isdigit(),
        'prev_is_digit': prevWord.isdigit(),
        'next_is_digit': nextWord.isdigit(),
        'curr_prefix-1': currWord[0],
        'curr_prefix-2': currWord[:2],
        'curr_prefix-3': currWord[:3],
        'curr_suffix-1': currWord[-1],
        'curr_suffix-2': currWord[-2:],
        'curr_suffix-3': currWord[-3:],
        'prev_prefix-1': prevWord[0],
        'prev_prefix-2': prevWord[:2],
        'prev_prefix-3': prevWord[:3],
        'prev_suffix-1': prevWord[-1],
        'prev_suffix-2': prevWord[-2:],
        'prev_suffix-3': prevWord[-3:],
        'next_prefix-1': nextWord[0],
        'next_prefix-2': nextWord[:2],
        'next_prefix-3': nextWord[:3],
        'next_suffix-1': nextWord[-1],
        'next_suffix-2': nextWord[-2:],
        'next_suffix-3': nextWord[-3:],
        'prev_word': prevWord,
        'next_word': nextWord,

Note that apart from the current word (at position "index" in the above function) properties, we are also using the same properties of the previous and the next words in the sequence in the same sentence.

print features(brown_tagged_sents[0], 3)

Will print the features for the 4th word ("Grand") in the first sentence :

[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]

of the brown corpus.

{'prev_word': u'County', 'is_first': False, 'prev_suffix-3': u'nty', 'curr_prefix-3': u'Gra', 'prev_suffix-1': u'y', 'next_prefix-3': u'Jur', 'next_is_digit': False, 'prev_prefix-3': u'Cou', 'prev_prefix-1': u'C', 'curr_suffix-2': u'nd', 'curr_suffix-3': u'and', 'curr_suffix-1': u'd', 'next_prefix-1': u'J', 'next_is_title': True, 'next_is_lower': False, 'prev_suffix-2': u'ty', 'prev_is_title': True, 'prev_is_lower': False, 'curr_is_upper': False, 'word': u'Grand', 'prev_is_digit': False, 'next_suffix-3': u'ury', 'next_suffix-2': u'ry', 'next_suffix-1': u'y', 'curr_is_digit': False, 'prev_is_upper': False, 'prev_prefix-2': u'Co', 'next_word': u'Jury', 'is_last': False, 'curr_prefix-1': u'G', 'curr_is_lower': False, 'next_is_upper': False, 'curr_prefix-2': u'Gr', 'next_prefix-2': u'Ju', 'curr_is_title': True}

In order to train a classifier from the scikit-learn package, we need to transform the features and the labels for each word in each sentence into the proper format, as done by the following function :

def transformDataset(sentences):
    wordFeatures = []
    wordLabels = []

    for sent in sentences:
        for index in range(len(sent)):
            wordFeatures.append(features(sent, index))

    return wordFeatures, wordLabels

Then we can define the function for fitting a decision tree classifier to the training sentences. We use the Pipeline method in scikit-learn for this.

def trainDecisionTree(trainFeatures, trainLabels):
    clf = Pipeline([
        ('vectorizer', DictVectorizer(sparse=False)),
        ('classifier', DecisionTreeClassifier(criterion='entropy'))
    ]), trainLabels)
    return clf

Then we can use the returned model to compute the accuracy score on the test sentences

clf.score(testFeatures, testLabels)

Since the number of training words in the brown corpus is around 0.9 million, thus the size of the feature matrix would be 0.9 million times 36 (number of features per word) which does not fit into the memory of my MacBook air. So instead of using the entire corpus, we would be using only the "news" categories to train and test the POS tagging models. The number of training words only for the "news" category is around 69K.

Similar to the decision tree classifier, we can also define the NaiveBayes classifier and the Neural Network classifiers to model the training instances. Below is the entire python code for comparing the numbers with different classifiers. The dataset is split into 70:30 (training:testing).

The accuracy numbers on the training data are computed using 5-fold cross validation and then averaged.

def transformDataset(sentences):
    wordFeatures = []
    wordLabels = []

    for sent in sentences:
        for index in range(len(sent)):
            wordFeatures.append(features(sent, index))

    return wordFeatures, wordLabels

def trainDecisionTree(trainFeatures, trainLabels):
    clf = make_pipeline(DictVectorizer(sparse=False), DecisionTreeClassifier(criterion='entropy'))
    scores = cross_val_score(clf, trainFeatures, trainLabels, cv=5), trainLabels)
    return clf, scores.mean()

def trainNaiveBayes(trainFeatures, trainLabels):
    clf = make_pipeline(DictVectorizer(sparse=False), MultinomialNB())
    scores = cross_val_score(clf, trainFeatures, trainLabels, cv=5), trainLabels)
    return clf, scores.mean()

def trainNN(trainFeatures, trainLabels):
    clf = make_pipeline(DictVectorizer(sparse=False),
                        MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(100), random_state=1))
    scores = cross_val_score(clf, trainFeatures, trainLabels, cv=5), trainLabels)
    return clf, scores.mean()

brown_tagged_sents = brown.tagged_sents(categories='news')

size = int(len(brown_tagged_sents) * 0.7)

tags = [tag for (word, tag) in brown.tagged_words()]
defaultTag = nltk.FreqDist(tags).max()

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

tagger = ngramTagger(train_sents, 2, defaultTag)
print tagger.evaluate(test_sents)

trainFeatures, trainLabels = transformDataset(train_sents)
testFeatures, testLabels = transformDataset(test_sents)

tree_model, tree_model_cv_score = trainDecisionTree(trainFeatures[:30000], trainLabels[:30000])
print tree_model_cv_score
print tree_model.score(testFeatures, testLabels)

nb_model, nb_model_cv_score = trainNaiveBayes(trainFeatures[:30000], trainLabels[:30000])
print nb_model_cv_score
print nb_model.score(testFeatures, testLabels) 

nn_model, nn_model_cv_score = trainNN(trainFeatures[:30000], trainLabels[:30000])
print nn_model_cv_score
print nn_model.score(testFeatures, testLabels)

On running the above codes, we obtain the following scores :

  • Bigram Tagger with backoff : 83.4% on testing data
  • Decision Tree with the first 30K words : 86.7% on training data and 85.3% on testing data
  • Naive Bayes with the first 30K words : 76.3% on training data and 74.9% on testing data
  • Neural Network with the first 30K words : 90.2% on training data and 89% on testing data

Thus we see that compared to a bi-gram tagger, decision tree and neural networks perform better in terms of accuracy on the test sentences. Neural Networks performs the best among them. The numbers with the bi-gram tagger are not directly comparable as we are considering only the first 30K words from training data in both Neural Network and Decision Tree. But with more words and more features, the accuracy numbers with the classifiers is going to increase further.

The improved performance of the classifier based tagger as compared to the Bi-gram tagger can be understood from the fact that in situations where the pair of (word, previous word tag) was never encountered in training data or the word was also never encountered, then the likely behavior is to assign the default 'NN' tag (most common in training data) to these words. But a classifier assigns a tag to this word not only based on the identity of this word (which was never learnt from training data), but also on other properties/features of this word, such as the previous and the next words, previous and the next word tags, is a digit or a title or upper case, prefix, suffix and so on.

Note the difference in performance of Naive Bayes classifier and the Neural Network. Since the NB classifier is a generative algorithm whereas NeuralNet is a discriminative algorithm, NB models the joint probability distribution of the POS tag and the features. Similar to the N-gram tagger if one or more than one features for a word in testing data was never encountered for the actual POS tag of the word in the training data, then the joint probability of the features with the actual tag for the word would be very low (but non zero due to smoothing factor) and the classifier is most likely to make an error in prediction, whereas discriminative classifiers like Neural Networks, do not rely on joint probabilities but learns weights instead and thus performs considerably better than the NB classifier.

Classification algorithms assume that the sequence of POS tags for a sentence are independent of one another but only depend on the features of each word in the sentence. But in reality if we know that current tag is DT (an article or determiner), then it is more likely to be followed by a noun NN. For example "The (DT) box(NN) is empty", "A(DT) bird(NN) is flying" and so on. To enable this kind of information into POS tagging, we are going to use sequence modeling algorithms, HMM and CRF.

Hidden Markov Models is a generative sequence modeling algorithm (sequence counterpart of Naive Bayes) whereas Conditional Random Fields is a discriminative sequence modeling algorithm. We are going to follow the tutorial on these slides to implement a HMM for POS tagging. HMM requires us to learn the parameters of the model, the transition probabilities from one POS tag to another POS tag and the emission probabilities of each word feature given a POS tag, from only the observed sentences. The POS tags for the words are assumed to be hidden (i.e. not given). The learning follows an EM algorithm, where we alternate between computing estimated values of the transition and emission probability matrix (E step) and computing the most likely values of the POS tags for each word in training sentences (M step).

Since in our problem, we have access to the POS tags of the training sentences, we will not use an EM algorithm to estimate the parameters, because we can directly compute the transition and the emission probabilities from the features and the annotated tags for each word. Using these probabilities we will predict the most likely sequence of POS tags for a test sentence using the Viterbi algorithm.

Below are the functions for computing the probabilities of each POS tag in the training sentences, the probabilities of each POS tag being the start tag in a sentence and the transition probabilities from one POS tag to another POS tag for all training sentences. Note that for every first word in a sentence, instead of the transition probabilities, we use the start probabilities.

def computeTagProbs(trainLabels, tagsDict):
    numTags = len(tagsDict)
    tagProbs = np.zeros(numTags)

    for sentenceLabels in trainLabels:
        for tag in sentenceLabels:
            tagProbs[tagsDict[tag]] += 1

    tagProbs += 1

    return tagProbs / np.sum(tagProbs)

def computeStartProbs(trainLabels, tagsDict):
    numTags = len(tagsDict)
    startProbs = np.zeros(numTags)

    for sentenceLabels in trainLabels:
        startTag = sentenceLabels[0]
        startProbs[tagsDict[startTag]] += 1

    startProbs += 1

    return startProbs/np.sum(startProbs)

def computeTransitionProbabilities(trainLabels, tagsDict):
    numTags = len(tagsDict)
    transMat = np.zeros(shape=(numTags, numTags))

    for sentenceLabels in trainLabels:
        for i in range(len(sentenceLabels)-1):
            tag1 = tagsDict[sentenceLabels[i]]
            tag2 = tagsDict[sentenceLabels[i+1]]
            transMat[tag1, tag2] += 1

    normalized_transmat = normalize(transMat+1, axis=1, norm='l1')

    return normalized_transmat

Also note that for the start probabilities and the transition probabilities, instead of giving a probability of 0 to a POS tag which never occurred in the beginning of a sentence or pair of tags which never follow one another in a sentence, we assign a probability of 1/(Number of possible tags). This is known as smoothing.

Next we define the function for computing the emission probabilities, i.e. given a particular POS tag, the conditional probability of observing a word feature for that tag.

def computeEmissionProbabilities(trainFeatures, trainLabels, tagsDict):
    numTags = len(tagsDict)

    emissionDict = defaultdict(lambda: defaultdict(int))
    uniqueKeys = set()

    for i in range(len(trainLabels)):
        sentenceFeatures = trainFeatures[i]
        sentenceLabels = trainLabels[i]

        for j in range(len(sentenceLabels)):
            tag = sentenceLabels[j]

            for key, val in sentenceFeatures[j].iteritems():
                transformedKey = str(key) + "__" + str(val)
                emissionDict[tag][transformedKey] += 1

    emissionMat = np.zeros(shape=(numTags, len(uniqueKeys)))

    featuresDict = {}
    for index, key in enumerate(uniqueKeys):
        featuresDict[key] = index

    for tag in tagsDict.keys():
        for key in featuresDict.keys():
            i = tagsDict[tag]
            j = featuresDict[key]

            emissionMat[i, j] = emissionDict[tag][key]

    normalized_emissionMat = normalize(emissionMat+1, axis=1, norm='l1')

    return normalized_emissionMat, featuresDict

Similar to transition probabilities, we assign the conditional probability of 1/(Total number of features) to a word feature which was never encountered for a particular POS tag in training. Note how we are computing the features. Since the features for each word is a python dict, we are calculating each transformed feature by concatenating the key and value pair form the dict. This way we generate all possible features for the training words.

'tagsDict' is an inverted index for POS tags, whereas 'featuresDict' is an inverted index for word features.

Lastly we define the function to do prediction on test sentences. The 'predictTags' module computes the most likely POS tag assignment for a sequence of words in a sentence using the Viterbi algorithm.

def predictTags(testFeatures, tagProbs, startProbs, transMat, emissionMat, tagsDict, featuresDict):
    numTags = len(tagsDict)

    bestTags = []

    for sentenceFeatures in testFeatures:
        bestTagsSentence = []
        lenSentence = len(sentenceFeatures)

        probMatrix, tagMatrix = np.zeros(shape=(lenSentence, numTags)), np.zeros(shape=(lenSentence, numTags))

        for index in range(lenSentence):
            feat = sentenceFeatures[index]

            for curr in range(numTags):

                emissionProb = 0
                for key, val in feat.iteritems():
                    transformedKey = str(key) + "__" + str(val)

                    if transformedKey in featuresDict:
                        emissionProb += math.log(emissionMat[curr, featuresDict[transformedKey]])
                        emissionProb -= math.log(len(featuresDict))

                emissionProb += math.log(tagProbs[curr])

                maxProb = -sys.float_info.max
                maxProbTag = -1

                if index == 0:
                    probMatrix[index][curr] = math.log(startProbs[curr]) + emissionProb
                    tagMatrix[index][curr] = -1
                    for prev in range(numTags):
                        tagProb = math.log(transMat[prev, curr]) + math.log(probMatrix[index - 1][prev])

                        if (tagProb > maxProb):
                            maxProb = tagProb
                            maxProbTag = prev

                    maxProb += emissionProb

                    probMatrix[index][curr] = maxProb
                    tagMatrix[index][curr] = maxProbTag

            const = -np.mean(probMatrix[index])

            func = np.vectorize(lambda t: math.exp(t+const))
            probMatrix[index] = func(probMatrix[index])

            probMatrix = normalize(probMatrix, axis=1, norm='l1')

        prevBestTag = None

        for index in reversed(range(lenSentence+1)):

            if index == lenSentence:
                bestTag = probMatrix[index-1].argmax()
                bestTag = tagMatrix[index][prevBestTag]

            prevBestTag = int(bestTag)


    return bestTags

Note that the algorithm defined so far is very similar to how Naive Bayes works, i.e. first we compute the conditional probabilities of the features given the tags, then use these probabilities to predict the most probable class label for an un-tagged word. Only difference is that now we are also using a transition probability, that defines the likelihood of observing a POS tag conditioned on the tag for the previous word in the same sentence.

def transformDatasetSequence(sentences):
    wordFeatures, wordLabels = [], []

    for sent in sentences:
        feats, labels = [], []

        for index in range(len(sent)):
            feats.append(features(sent, index))


    return wordFeatures, wordLabels

def trainHMM(trainFeatures, trainLabels, tagsDict):

    tagProbs = computeTagProbs(trainLabels, tagsDict)
    startProbs = computeStartProbs(trainLabels, tagsDict)
    transMat = computeTransitionProbabilities(trainLabels, tagsDict)
    emissionMat, featuresDict = computeEmissionProbabilities(trainFeatures, trainLabels, tagsDict)

    return tagProbs, startProbs, transMat, emissionMat, featuresDict

tagsDict = {}
for index, tag in enumerate(set(tags)):
    tagsDict[tag] = index

trainSeqFeatures, trainSeqLabels = transformDatasetSequence(train_sents)
testSeqFeatures, testSeqLabels = transformDatasetSequence(test_sents)

tagProbs, startProbs, transMat, emissionMat, featuresDict = trainHMM(trainSeqFeatures[:30000], trainSeqLabels[:30000], tagsDict)

predictedTags = predictTags(testSeqFeatures[:100], tagProbs, startProbs, transMat, emissionMat, tagsDict, featuresDict)

Due to multiple nested for-loops in the predictTags code, the prediction code is slow. With 100 test sentences, we achieved an accuracy of 74%. We can improve the speed by writing the code in either C or C++ but that would require some extra work.

Lastly we look at how Conditional Random Fields can be used to train a POS tagger. The relation between Neural Network/Logistic regression and Naive Bayes is same as the relation between CRF and HMM. So we would expect similar improvements with CRF as Neural Networks from HMM and Naive Bayes respectively.

We use the 'sklearn_crfsuite' utility in python to train and test our POS tagger using a linear chain CRF. We use the same 'transformDatasetSequence' function to generate the features and labels for each word and each sentence as in HMM.

def trainCRF(trainFeatures, trainLabels):
    crf = sklearn_crfsuite.CRF(
    ), trainLabels)

    return crf

crf_model = trainCRF(trainSeqFeatures[:30000], trainSeqLabels[:30000])
pred_labels = crf_model.predict(testSeqFeatures)

The parameter definition of the algorithm is as follows :

  • 'algorithm' refers to the optimization technique used to minimize the log-linear loss function and compute the feature weights. 'L-BFGS' is almost the preferred choice for most log-linear models (with relatively small number of training examples).
  • 'c1' refers to the constant term for the L1 regularization term.
  • 'c2' refers to the constant term for the L2 regularization term.
  • 'max_iterations' refers to the number of iterations in the optimization algorithm. If the algorithm does not converges before the 'max_iterations', then we stop optimizing further.
  • 'all_possible_transitions' set to 'False' implies that we only learn weights for transitions which are present in the training sentences, whereas if set to 'True' will compute weights for all possible transitions. If set to "True" then for all invalid transitions (transitions not observed in training data), the weights would be negative, but the weights of the valid transitions would accordingly adjust

More parameter definition can be found in the docs for 'crfsuite'.

With 'all_possible_transitions' set to False, we obtain an accuracy of 90.2% on the test sentences, whereas when it is set to true, we obtain an accuracy of 92.6% (both considerably greater than HMM)

In general, CRF has the following advantages over HMM :

  • CRF is a discriminative algorithm, whereas HMM is generative algorithm.
  • CRF training computes a globally optimum solution, whereas the EM training in HMM computes local minima for the loss function.
  • The transition probabilities in HMM are computed based only on the consecutive transitions of POS tags in the training data (Markov assumption), whereas in CRF, the feature functions can take into account arbitrary transitions. The notes here is quite helpful in understanding features in log linear models.

Instead of assuming that the current tag is dependent only on the previous tag in HMM, we can also consider the previous two tags as in the tri-gram HMM.

All the codes displayed here can be found in my github profile.


Tags: , , , , , , ,