Stokastik

Machine Learning, AI and Programming

Building an Incremental Named Entity Recognizer System

In the last post, we saw how to train a system to identify Part Of Speech tags for words in sentences. In essence we found out that discriminative models such as Neural Networks and Conditional Random Fields, outperforms other methods by 5-6% in prediction accuracy. In this post, we will look at another common problem in Natural Language Processing, known as the Named Entity Recognition (NER in short). The problem is to identify, words or phrases in documents, reviews, tweets etc. that points to some entity. An entity could be either a person or a location or an organization or an event and so on.

For example in the sentence : "Christian Bale acted as the Batman and Heath Ledger as the Joker in the movie The Dark Knight"

The named entities would be "Christian Bale" (Person), "Batman" (Character), "Heath Ledger" (Person), "Joker" (Character), and "Dark Knight" (Movie).

Specific problems would have its own set of entities (e.g. Amazon reviews would have Buyer Name and Product Name as entities, Travel reviews on Trip-advisor would have Country Name, City, Sightseeing, Activity etc. as entities)

Extracting entities helps us to formulate relations between different entities and answer questions such as Who portrayed the character of Joker in the movie Dark Knight ? or What are the best activities for a solo traveller in Thailand ? and so on. This are just toy examples as the same methodology of extracting relations helps to drive traffic in user content based websites such as TripAdvisor, Quora, IMDB etc.

We will use the Python NLTK libraries to train and test our own NER system. We will also look into how we can train, test our NER system in an incremental manner i.e. real time incremental training with only a few examples using the Out-of-Core learning method in scikit-learn package.

Let's look at the inbuilt 'ne_chunk' in NLTK to see what NER output does it generate for the above sentence.

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "Christian Bale acted as the Batman and Heath Ledger as the Joker in the movie The Dark Knight"

print ne_chunk(pos_tag(word_tokenize(sentence)))
(S
  (GPE Christian/JJ)
  (PERSON Bale/NNP)
  acted/VBD
  as/IN
  the/DT
  (ORGANIZATION Batman/NNP)
  and/CC
  (PERSON Heath/NNP Ledger/NNP)
  as/IN
  the/DT
  Joker/NNP
  in/IN
  the/DT
  movie/NN
  The/DT
  (ORGANIZATION Dark/NNP Knight/NNP))

Note that the word 'Christian' is assigned GPE (Geo-Political Entity). Probably it has mistaken it for "christianity" the religion. Similarly 'Batman' and 'Dark Knight' is assigned to ORGANIZATION, probably because the trained corpus did not have tags for movie names or movie characters (an example of problem specific tags).

The obvious place to start is to use the Conditional Random Fields from the previous post on an already NER labelled corpus. The CoNLL 2002 corpus is a NER labelled corpus for Spanish and Dutch language (already split into train and test sets). The only difference from the brown corpus is that each annotated word is a 3-tuple (word, POS tag, NER tag).

Following functions defines the features we are extracting in order to train the NER recognizer.

def wordShapeFeatures(word, wordId):
    return {
        wordId + '_is_title': word.istitle(),
        wordId + '_is_lower': word.islower(),
        wordId + '_is_upper': word.isupper(),
        wordId + '_is_digit': word.isdigit(),
        wordId + '_is_camelcase': re.match('[A-Z][a-z]+[A-Z][a-z]+[A-Za-z]*$', word) is not None,
        wordId + '_is_abbv': re.match('[A-Za-z0-9]+\.[A-Za-z0-9\.]+\.$', word) is not None,
        wordId + '_has_hyphen': re.match('[A-Za-z0-9]+\-[A-Za-z0-9\-]+.*$', word) is not None,
    }

def wordProperties(word, wordId):

    stemmer = SnowballStemmer('english')

    return {
        wordId + '_stemmed': stemmer.stem(word),
        wordId + '_prefix-1': word[0],
        wordId + '_prefix-2': word[:2],
        wordId + '_prefix-3': word[:3],
        wordId + '_suffix-1': word[-1],
        wordId + '_suffix-2': word[-2:],
        wordId + '_suffix-3': word[-3:],
        wordId + '_lower': word.lower(),
    }

def features(sentence, index):

    currWord = sentence[index][0]

    if (index > 0):
        prevWord = sentence[index - 1][0]
        prevTag = sentence[index - 1][1]
    else:
        prevWord = '<START>'
        prevTag = '<START_TAG>'

    if (index < len(sentence)-1):
        nextWord = sentence[index + 1][0]
        nextTag = sentence[index + 1][1]
    else:
        nextWord = '<END>'
        nextTag = '<END_TAG>'

    currWordShape = wordShapeFeatures(currWord, 'currWd')
    prevWordShape = wordShapeFeatures(prevWord, 'prevWd')
    nextWordShape = wordShapeFeatures(nextWord, 'nextWd')

    currWordProp = wordProperties(currWord, 'currWd')
    prevWordProp = wordProperties(prevWord, 'prevWd')
    nextWordProp = wordProperties(nextWord, 'nextWd')

    outFeatures = {
        'word': currWord,
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'prev_word': prevWord,
        'next_word': nextWord,
        'prev_tag': prevTag,
        'next_tag': nextTag,
    }

    outFeatures.update(currWordShape)
    outFeatures.update(prevWordShape)
    outFeatures.update(nextWordShape)

    outFeatures.update(currWordProp)
    outFeatures.update(prevWordProp)
    outFeatures.update(nextWordProp)

    return outFeatures

def transformDatasetSequence(sentences):
    wordFeatures, wordLabels = [], []

    for sent in sentences:
        feats, labels = [], []

        for index in range(len(sent)):
            feats.append(features(sent, index))
            labels.append(sent[index][2])

        wordFeatures.append(feats)
        wordLabels.append(labels)

    return wordFeatures, wordLabels

def trainCRF(train_sents):
    trainFeatures, trainLabels = transformDatasetSequence(train_sents)

    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=100,
        all_possible_transitions=True,
    )
    crf.fit(trainFeatures, trainLabels)

    return crf
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

trainSeqFeatures, trainSeqLabels = transformDatasetSequence(train_sents)
testSeqFeatures, testSeqLabels = transformDatasetSequence(test_sents)

crf_model = trainCRF(trainSeqFeatures, trainSeqLabels)
pred_labels = crf_model.predict(testSeqFeatures)

We obtained an accuracy of 97.15% on the test set.

If we look at the probability distribution of the NER tags in the training and testing data of the above corpus :

{u'I-LOC': 0.007143531722796215, u'B-ORG': 0.02791681619855316, u'O': 0.8761120450295601, u'I-PER': 0.014744158812307576, u'I-MISC': 0.012133804280074798, u'B-MISC': 0.008208828362578623, u'I-ORG': 0.018858017112743895, u'B-LOC': 0.01855958294769847, u'B-PER': 0.016323215533687173}

{u'I-LOC': 0.006306638464673123, u'B-ORG': 0.027167058001668836, u'O': 0.8801156540469215, u'B-PER': 0.014262705450876139, u'I-PER': 0.01230279626647003, u'I-MISC': 0.010808608076378243, u'B-MISC': 0.006578309044689811, u'I-ORG': 0.021423165738458854, u'B-LOC': 0.021035064909863583}

We see that the others tag 'O' contributes to about 88% in both training and testing, i.e. if we predict 'O' for all words in testing data, we would still obtain 88% accuracy. Hence we need to use use different metrics to evaluate the performance.

Let's remove the 'O' tag and then evaluate the performance of the remaining tags on weighted F1 score.

from sklearn_crfsuite import metrics

crf_model = trainCRF(trainSeqFeatures, trainSeqLabels)

labels = list(crf_model.classes_)
labels.remove('O')

pred_labels = crf_model.predict(testSeqFeatures)

print metrics.flat_f1_score(testSeqLabels, pred_labels, average='weighted', labels=labels)

The weighted average F1-score is 80% (considerably less than the accuracy score).

Let's look at the precision, recall and F1-score values for each tag (not 'O') :

sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))

print metrics.flat_classification_report(testSeqLabels, pred_labels, labels=sorted_labels, digits=3)
             precision    recall  f1-score   support

      B-LOC      0.805     0.784     0.794      1084
      I-LOC      0.669     0.671     0.670       325
     B-MISC      0.670     0.546     0.602       339
     I-MISC      0.662     0.627     0.644       557
      B-ORG      0.827     0.844     0.835      1400
      I-ORG      0.854     0.793     0.822      1104
      B-PER      0.855     0.893     0.874       735
      I-PER      0.890     0.946     0.917       634

avg / total      0.806     0.795     0.800      6178

Although we have trained a NER but it is on the Spanish corpus. There is no NER annotated English corpus available in the NLTK library. We will use the GMB data-set freely available for download to train and test our NER (in english) in all our future work.

Following function reads the annotated sentences from the GMB data :

def read_gmb(corpus_root):
    ner_tagged_sentences = []
    for root, dirs, files in os.walk(corpus_root):
        for filename in files:
            if filename.endswith(".tags"):
                with open(os.path.join(root, filename), 'rb') as file_handle:
                    file_content = file_handle.read().decode('utf-8').strip()
                    annotated_sentences = file_content.split('\n\n')
                    for annotated_sentence in annotated_sentences:
                        annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq]

                        standard_form_tokens = []

                        for idx, annotated_token in enumerate(annotated_tokens):
                            annotations = annotated_token.split('\t')
                            word, tag, ner = annotations[0], annotations[1], annotations[3]

                            if ner != 'O':
                                ner = ner.split('-')[0]

                            if tag in ('LQU', 'RQU'):
                                tag = "``"

                            standard_form_tokens.append((word, tag, ner))

                        conll_tokens = to_conll_iob(standard_form_tokens)
                        ner_tagged_sentences.append(conll_tokens)

    return ner_tagged_sentences

The NER tags for each word in the corpus are very 'deep', for example if 'per' is the high level NER tag for person, then in the corpus, 'per-ini' refers to the initial of a person's name, 'per-mid' refers to the middle name of a person and so on. With so many unique tags, it is difficult to train a tagger to learn features for each tag well. Thus we eliminate the sub tags and only use the high level tags, i.e. 'per-ini' and 'per-mid' will both be considered to be 'per' tag.

The NER tags in the corpus are represented as 'per' (person) or 'geo' (location) or 'gpe' (geo-political) or 'eve' (event) and so on. The standard way to represent NER tags in a corpus is either a ne-chunk tree or IOB tags. Since the above codes written for training and testing NER tags in the CoNLL 2002 corpus uses the IOB tagging system, we will convert the GMB data-set tags into same IOB format.

In IOB format, each NER tag is pre-pended with either B- or I-. Where 'B-per' represents the first word of the person tag and 'I-per' would represent the remaining words of the same person tag. For example, in the sentence : "The author of this book is John Smith Wayne", then "John" would be assigned 'B-per' and "Smith" and "Wayne" would be assigned 'I-per' tags. All non NER tags are represented by the 'O' (others) tags. Below function converts the GMB tags to IOB format.

def to_conll_iob(annotated_sentence):
    proper_iob_tokens = []
    for idx, annotated_token in enumerate(annotated_sentence):
        tag, word, ner = annotated_token

        if ner != 'O':
            if idx == 0:
                ner = "B-" + ner
            elif annotated_sentence[idx - 1][2] == ner:
                ner = "I-" + ner
            else:
                ner = "B-" + ner
        proper_iob_tokens.append((tag, word, ner))
    return proper_iob_tokens

Next we divide the entire corpus into train-test (by a 70-30 split) and then train and evaluate the performance of the tagger on the test data.

corpus_root = "gmb-2.2.0"

sentences = read_gmb(corpus_root)

size = int(len(sentences) * 0.7)

train_sents = sentences[:size]
test_sents = sentences[size:]

trainSeqFeatures, trainSeqLabels = transformDatasetSequence(train_sents)
testSeqFeatures, testSeqLabels = transformDatasetSequence(test_sents)

crf_model = trainCRF(trainSeqFeatures, trainSeqLabels)

labels = list(crf_model.classes_)
labels.remove('O')

pred_labels = crf_model.predict(testSeqFeatures)

print metrics.flat_f1_score(testSeqLabels, pred_labels, average='weighted', labels=labels)

sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
print metrics.flat_classification_report(testSeqLabels, pred_labels, labels=sorted_labels, digits=3)

We obtain a weighted F1-score of 95% with all tags and 84.3% on non 'O' tags. Following are the precision and recall values for each NER tag (non 'O').

             precision    recall  f1-score   support

      B-art      0.247     0.147     0.184       143
      I-art      0.167     0.148     0.157        88
      B-eve      0.554     0.376     0.448       109
      I-eve      0.417     0.278     0.333        90
      B-geo      0.857     0.895     0.876     14783
      I-geo      0.792     0.804     0.798      2748
      B-gpe      0.959     0.952     0.955      6157
      I-gpe      0.803     0.679     0.736        78
      B-nat      0.760     0.365     0.494        52
      I-nat      0.875     0.467     0.609        15
      B-org      0.785     0.719     0.751      8073
      I-org      0.808     0.776     0.792      6746
      B-per      0.834     0.819     0.827      6564
      I-per      0.840     0.889     0.864      6609
      B-tim      0.921     0.875     0.897      7919
      I-tim      0.837     0.723     0.776      2587

avg / total      0.849     0.838     0.843     62761

Next we will look at how we can learn a NER model incrementally. Incremental learning is useful when either the data to be used for training cannot be loaded into memory all at once or the data is coming in an infinite stream as in a production pipeline. In both of these cases, all the training data and all the important features are not available at once and requires special algorithms to build the model incrementally.

Another advantage of incremental learning is that we can start to learn a model with very few examples (either one example at a time or a batch of examples).

Not all classification or sequence classification algorithms are capable of incremental learning. Some of the common algorithms used for incremental or online learning are the SGD Classifier, Online Perceptron, Passive Aggressive Classifier, Naive Bayes, Neural Networks with SGD and so on. From this tutorial on Out-Of-Core learning with incremental classifiers, we see that Passive Aggressive Classifier performs comparatively better than the rest.

First we modify the 'read_gmb' function to return a generator, instead of returning the annotated sentences all at once. The generator fetches annotated sentence one at a time as and when required (lazy loading) either during some operation or iterating over the generator :

def read_gmb(corpus_root):
    for root, dirs, files in os.walk(corpus_root):
        for filename in files:
            if filename.endswith(".tags"):
                with open(os.path.join(root, filename), 'rb') as file_handle:
                    file_content = file_handle.read().decode('utf-8').strip()
                    annotated_sentences = file_content.split('\n\n')
                    for annotated_sentence in annotated_sentences:
                        annotated_tokens = [seq for seq in annotated_sentence.split('\n') if seq]

                        standard_form_tokens = []

                        for idx, annotated_token in enumerate(annotated_tokens):
                            annotations = annotated_token.split('\t')
                            word, tag, ner = annotations[0], annotations[1], annotations[3]

                            if ner != 'O':
                                ner = ner.split('-')[0]

                            if tag in ('LQU', 'RQU'):
                                tag = "``"

                            standard_form_tokens.append((word, tag, ner))

                        conll_tokens = to_conll_iob(standard_form_tokens)
                        yield conll_tokens

Next we define the functions, to extract training sentences from the 'read_gmb' generator, one batch at a time. We define the 'batch_size' parameter for that. Smaller batch sizes results in faster incremental training but small or no visible improvements in model performance between two consecutive batch updates, whereas large batch sizes results in slower training but comparatively greater improvements with each batch.

def get_minibatch(sentences, batch_size):
    batch = list(itertools.islice(sentences, batch_size))
    wordFeatures, wordLabels = transformDataset(batch)

    return wordFeatures, wordLabels

def iter_minibatches(sentences, batch_size):
    wordFeatures, wordLabels = get_minibatch(sentences, batch_size)

    while len(wordFeatures):
        yield wordFeatures, wordLabels
        wordFeatures, wordLabels = get_minibatch(sentences, batch_size)

Note that 'iter_minibatches' function also returns a generator, i.e. we would be fetching the features and the labels for each word in the training sentences as and when required by some operation.

'transformDataset' function is used to generate the features and labels for each word.

def transformDataset(sentences):
    wordFeatures, wordLabels = [], []

    for sent in sentences:
        for index in range(len(sent)):
            wordFeatures.append(features(sent, index))
            wordLabels.append(sent[index][2])

    return wordFeatures, wordLabels

Finally we come to the main function to do incremental training with the Passive Aggressive Classifier. Instead of using a DictVectorizer for the features as used earlier in POS tags training, we are using FeatureHasher, which is much faster compared to DictVectorizer as it uses very less number of hashed set of features (5000 in our case), with only a small hit on model performance.

def trainOnline(train_sents, tags, batch_size=500):
    minibatch_iterators = iter_minibatches(train_sents, batch_size)

    hasher = FeatureHasher(n_features=5000)

    clf = PassiveAggressiveClassifier()

    for i, (trainFeatures, trainLabels) in enumerate(minibatch_iterators):

        trainFeatures = hasher.transform(trainFeatures)
        clf.partial_fit(trainFeatures, trainLabels, tags)

        yield Pipeline([('hasher', hasher), ('classifier', clf)])

The 'partial_fit' function is used for fitting an incremental classifier. For each batch of training sentences we are returning a generator with a pipeline of the 'hasher' and the 'classifier'. The generator will be subsequently iterated over in the testing function. We are computing the F1-score for each incremental model returned by the above generator on a hold out set of 5000 test sentences.

def testOnline(onlineModel, test_sents):
    testFeatures, testLabels = transformDataset(test_sents)

    for model in onlineModel:
        predLabels = model.predict(testFeatures)

        print f1_score(testLabels, predLabels, average='weighted')

    return 1

sentences = read_gmb(corpus_root)

train_sents = itertools.islice(sentences, 50000)
test_sents = itertools.islice(sentences, 5000)

clf = trainOnline(train_sents, tags, batch_size=500)

testOnline(clf, test_sents)

The first 50,000 sentences from the GMB corpus are being used in training and the next 5000 for testing. Instead of loading all the 50,000 sentences into memory and training a classifier, we are loading 100 batches of 500 sentences each and training an incremental classifier.

After the first batch, we have a model with 500 sentences, after the second batch, we have a model "as if it were trained" on 500x2=1000 sentences, but note that we are not actually training the model on all the 1000 sentences. Similarly after the third batch we have model "as if it were trained" on 500x3=1500 sentences and so on.

Let's look at how the incremental model is performing on the test set of 5000 sentences (weighted F1-score from the first 15 models, i.e. as if the last number in the below list is generated from 500*15=7500 sentences):

0.935911237291
0.933770006064
0.940618897653
0.939038857781
0.942921124201
0.943103841306
0.942573173924
0.940521028049
0.942295129288
0.94602997944
0.944351132335
0.946048178805
0.945906296541
0.944790869535
0.946004710563

Notice that initially the F1-score increases monotonously but after a certain point, they start to zig-zag around a particular value (possibly leading to saturation). In our example, it zig-zags around 94.5% F1-score. It also shows that, even if all the training data would fit into the memory, but only a fraction of them is enough to achieve the desired level of performance.

Training the CRF with the first 50,000 sentences and testing on the next 5000, gives an F-score of 96.2% with the 'O' tags and 91% without the 'O' tags. Although the F1-score including the 'O' tags for the incremental classifier is around 94.5% , but the F1-score without the 'O' tags is around 73 to 74%, much less than 91% obtained with CRF.

It implies that there is a possible scope of improvement in the performance of the incremental NER recognizer, although note that an incremental classifier should intuitively have lower performance as compared to a classifier that "sees" all the data at once.

The entire code is hosted on my Github profile.

Some applications of NER in our most day to day internet applications could be :

  • Natural Language Search : When you type in "Photos of my Friends who work at Microsoft" on the Facebook search bar, you get to see the photos of your friends who are working in Microsoft. How does the system knows what to show in the search results ? Behind the scenes, the system extracts the named entities out of the search queries and then using the relations mapped between the entities, fetched from relational databases like MySQL, it is showing you the results. There is very nice post on how Natural Language Search works under the hood at Facebook.
  • Automatic Hyperlinking : For example, one might write the following review for Vietnam on Trip-advisor, "The Halong Bay tour was very especially memorable. We took photos at the Cu-Chi tunnels". NER system could identify the entities "Halong Bay", "Cu-Chi tunnel" or "photos at the Cu-Chi tunnel" and link these texts back to the overview pages of Halong Bay or Cu-Chi tunnel or the photos page of Cu-Chi tunnel. This could help users reading the reviews to navigate better within a particular "place of interest".
  • Identifying named entities as features for classification could help to reduce the feature space as it will remove unwanted set of features such as stop-words, POS tags such as the adverbs and adjectives and so on. But before extracting the named entities as features for classification, we need to have a NER model in place trained on similar domain of the text documents. For example, an NER trained with Shakespeare's works will not work well with financial documents.

Categories: MACHINE LEARNING, PROGRAMMING

Tags: , , , , , ,