Stokastik

Machine Learning, AI and Programming

Understanding Conditional Random Fields

Sequence modeling and prediction is a very common data mining task. The most common use case for sequence modeling is in NLP (POS-Tagging and Named-Entity Recognition). NER can have many possible variants depending on the named entities such as identifying city name, country name, pin codes etc. in addresses, identifying movie name, actors, genre etc. in movie reviews and so on. Apart from NER, some other use cases of sequence modeling could be to predict whether a customer would become a premium customer or not,  would the stock price go high or low, suggesting spelling corrections, facial recognition and so on.

Sequence modeling differs from normal classification in that, in classification, each observation or instance is predicted independent of the other observations or instances, whereas in sequence prediction, each instance is predicted based on the predictions of surrounding instances. For example, consider the task of classifying an email as spam or not spam. Whether an email in your mailbox is a spam or not spam is most likely independent of whether the previous email was a spam or not spam, thus there is not much sequential relation between the emails in your mailbox.

But in order to predict which words in a movie review is part of a movie name, then the surrounding words become important, in cases where the movie name has not been encountered earlier (i.e. the word is not trained by the model) or there could be ambiguity in resolving multiple possible entities (for example the word "Rings" could refer to the horror series "Rings" or it could refer to "Lord of the Rings"). In such scenarios the surrounding words helps to disambiguate between multiple entities. Similarly to predict whether a customer would become premium customer or not, would depend on his/her previous transactions, predicting whether stock price would go high or low would depend on the moving averages of the stock price and so on.

Most articles on CRF, start with comparison of CRF with Naive Bayes, Maximum Entropy classifier (Logistic Regression) and HMM. The comparison happens in two different dimensions, discriminative vs. generative and classifier vs. sequence predictor. Naive Bayes is a "generative" "classifier", Max-Ent is a "discriminative" "classifier", HMM is a "generative" "sequence predictor" while CRF is "discriminative" "sequence predictor". So why only these 4 algorithms are chosen ?

Naive Bayes is generative classifier, i.e. it models the joint probability distribution of the inputs X with the output class labels Y :

p(Y, X)=p(Y)\prod_{i=1}^mp(x_i|Y)

where x_i are the features (words, N-grams etc.) present in the input instance X.

To compute the likelihood of each class label given an unlabelled instance X', we compute :

p(Y=y|X')=\frac{P(Y=y, X')}{P(X')}

P(X') is a normalization constant and can be neglected while comparing the same instance X' for all class labels

So what are the drawbacks of Naive Bayes ?

Naive Bayes works on the assumption that, each observation (feature) x_i is independent of other features given the class label Y. In the equation for the joint probability, with 4 features x_1, x_2, x_3, x_4, without any independence assumption, one could write the joint probability as :

p(Y, X)=p(Y, x_1, x_2, x_3, x_4)=p(x_1|Y,x_2, x_3, x_4)*p(x_2|Y,x_3, x_4)*p(x_3|Y,x_4)*p(x_4|Y)*p(Y), using the chain rule of probability

Assuming that each feature is independent of other features given class label Y, we have :

p(Y, X)=p(Y, x_1, x_2, x_3, x_4)=p(x_1|Y)*p(x_2|Y)*p(x_3|Y)*p(x_4|Y)*p(Y)

The assumption is made in order to make the computation of p(Y, X) easier. To compute p(x_1|Y) we need to only count the number of instances belonging to class Y having the feature x_1 and divide that by the number of instances belonging to class Y in the training set. We can cache the counts once with the training data and re-use it every time or update it in real time.

But if we are working without assumption, p(x_1|Y, x_2, x_3, x_4) then we would need to count all possible training instances belonging to class Y, containing all features x_2, x_3, x_4 along with x_1. And this computation would not be a one time task but need to be re-computed for every test instance depending on which features are present in the instance.

This "naive" assumption may not work always as the probability of the word "drugs" in a spam e-mail is increased by the presence of other features like "cheap", all capitalized words in a paragraph etc. A non-spam email can have the word "drugs" in a completely valid context, like the mail from your family doctor about your medication.

In discriminative algorithms, we do not make any assumptions about the dependency between individual features or observations.

When the Naive Bayes algorithm is extended to a sequence of observations, we get the HMM algorithm. HMM is also a generative algorithm and makes the same independence assumption, that given the label for the current observation, the current observation is independent of all other observations. But whereas in Naive Bayes, we had one output label for multiple features (or observations), in HMM, there is one label corresponding to one observation or feature.

(Note that in classification, each training instance is a collection of features i.e. set of all words in a document and the class label for that document e.g. an email is a set of words but the entire email is classified as spam or not spam, not the individual words, whereas in sequence prediction, each training instance is a word and the corresponding label for that word, e.g. each word in a movie review can be a movie name or actor name or other entities).

In HMM, the sequence information is encoded in the transition probabilities, i.e. probability of transitioning from one label to another label, and by the Markov assumption, each label is independent of all earlier labels in the sequence given only the previous label.

Thus now given 4 observations x_1, x_2, x_3, x_4, each observation will now have one label each y_1, y_2, y_3, y_4, and the joint probability is given as :

p(Y, X)=p(y_1, y_2, y_3, y_4, x_1, x_2, x_3, x_4)=[p(y_1)*p(x_1|y_1)]*[p(y_2|y_1)*p(x_2|y_2)]*[p(y_3|y_2)*p(x_3|y_3)]*[p(y_4|y_3)*p(x_4|y_4)]

where the probabilities p(y_i|y_{i-1}) are called the transition probabilities and p(x_i|y_i) are called the emission probabilities.

For example if

[x_1, x_2, x_3, x_4, x_5]=[\text{

them we would have the following POS tags :

[y_1, y_2, y_3, y_4, y_5]=[\text{NOUN, NOUN, VERB, ADVERB, ADJECTIVE}]

HMM also suffers from reduced performance due to the independence assumption similar to Naive Bayes.

Maximum Entropy classifier is the discriminative version of the Naive Bayes classifier. In maximum entropy classifier, we do not model the joint probability distribution of the inputs and the outputs, instead we directly model the conditional probability of the output class labels Y given the inputs X, i.e. p(Y|X)

p(Y|X)=\frac{exp(\sum_{i=1}^m{\lambda}_if_i(X, Y))}{\sum_{Y'}exp(\sum_{i=1}^m{\lambda}_if_i(X, Y'))}

The functions f_i(X, Y) are the feature functions and {\lambda}_i are the feature weights. This is very similar to logistic regression. In fact for a binary classification (0-1) problem, the above conditional probability reduces to the familiar representation for logistic regression :

p(Y=1|X)=\frac{1}{1+exp(-\sum_{i=1}^m{\lambda}_if_i(X, Y))}

In text classification, the feature functions f_i(X, Y) are just the term frequency (TF) or the TF-IDF values for each features x_j (i.e. words, N-grams) in the input instance X. Whereas in NLP, the feature functions can represent arbitrary feature values, for example, in a movie review :

f_i(X, Y)=1 if X="Matrix Revolutions" and Y=MOVIE_NAME else f_i(X, Y)=0

f_i(X, Y)=1 if X="Tom Cruise" and Y=ACTOR_NAME else f_i(X, Y)=0

CRF's uses similar notation for feature functions as above. The relation between Max-Ent classifier and CRF is the same as the relation between Naive Bayes and HMM only that Max-Ent and CRF are both discriminative algorithms. The difference between CRF and Maximum Entropy is that Max-Ent is a classifier whereas CRF is a sequence predictor and hence the feature functions for CRF takes into account previous values of Y in the observed sequence, i.e. for a linear chain CRF,

f_i(X, y_j, y_{j-1}, j) is a feature function representation where 'j' is the current position of the observation.

f_i(X, y_j, y_{j-1}, j)=1 if x_j="Bangalore" and y_j=CITY_NAME and and y_{j-1}=STREET_NAME else f_i(X, y_j, y_{j-1}, j)=0

The above could be a feature for identifying city names and street names in addresses, because the city name usually comes after the street name in an address.

My current address is "#108, RAJ PARADISE, BILEKAHALLI, BANGALORE, 560076" , the NER would be :

[x_1, x_2, x_3, x_4, x_5, x_6]=[\text{'#108', 'RAJ', 'PARADISE', 'BILEKAHALLI', 'BANGALORE', '560076'}]

[y_1, y_2, y_3, y_4, y_5, y_6]=[\text{HS_NUM, HS_NAME, HS_NAME, STREET, CITY, CODE}]

The objective of the Max-Ent classifier is to maximize the conditional entropy of the class labels given the inputs. For a m-class and n-feature classification, the conditional entropy is given as :

H(y_1, y_2,..., y_m|x_1, x_2, ..., x_n)=-\sum_{i=1}^m\sum_{j=1}^np(y_i, x_j)*log(p(y_i|x_j))

Why does maximizing the conditional entropy makes sense for a classifier ?

Assuming that for a 4-class (A,B,C and D) classification problem, the feature x_1 is present in 40% of the instances belonging to class A, but is absent from the other 3 classes. During prediction, if the test instance contains the feature x_1, then the probability that the instance belongs to class A is 40%, while the instance belongs to the other 3 classes should be 20% each i.e. maximum entropy probability distribution.

According to the Principle of Maximum Entropy, in the presence of incomplete information, the most likely class probability distribution is the one which maximizes the entropy, i.e. if a feature is not present in any of the 4 classes, then the class probability distribution w.r.t the feature should be

[0.25, 0.25. 0.25, 0.25]

Apart from maximizing the conditional entropy, we have to add the constraint that the expected value of each feature function f_i(X, Y) in the training data should be equal to the expected value of the same feature function as predicted by the model.

The expected value of a feature function f_i(X, Y), for a specific X and Y pair in the training data, is the number of documents belonging to class Y, with f_i(X, Y)=1 divided by the total number of documents, N in the training data.

Then the expected value of the overall feature function E_1[f_i] is :

\frac{1}{N}\sum_{X'}f_i(X', Y)

Note that we are summing over only the documents X, since each document is associated with only one class label Y in the training data.

The expected value of the feature function f_i as predicted by the model is :

E_2[f_i]=\sum_{X'}\sum_{Y'}p(X', Y')*f_i(X', Y')

where p(X', Y') is the joint probability of the input instance X' with the class label Y'. The model would associate each instance X' to all class labels Y' with certain probabilities. Thus for instance X', there would be M different values of p(X', Y') predicted by the model.

Since Max-Ent do not model p(X', Y') but only models p(Y'|X'), using laws of probability, we can write :

p(X', Y')=p(Y'|X')*p(X')

where p(X') is the probability distribution of the input samples. We can assume that p(X')=\frac{1}{N}, where N is the number of training documents.

E_2[f_i]=\frac{1}{N}\sum_{X'}\sum_{Y'}p(Y'|X')*f_i(X', Y')

After formulating the Lagrangian with the conditional entropy plus the constraints, gives a concave function, hence the solution obtained for p(Y|X) as above is a globally optimal solution.

p(Y|X)=\frac{exp(\sum_{i=1}^m{\lambda}_if_i(X, Y))}{\sum_{Y'}exp(\sum_{i=1}^m{\lambda}_if_i(X, Y'))}

To compute the values of the feature weights {\lambda}_i, we can use iterative optimization algorithms like the gradient descent, Newton's method etc. Most articles on Max-Ent, refers to the Iterative Scaling algorithm for computing the values of {\lambda}_i.

Till now, we have seen how Naive Bayes, HMM and Maximum Entropy are related and what are the pros and cons for each. Most articles on CRF, explain it from the perspective of probabilistic graphical models. In this post, we will not go through probabilistic graphical models to explain the differences between Naive Bayes, HMM, Max-Ent and CRF but to give some general pointers on their differences, Naive Bayes and HMM (both generative) are represented using directed graphical models whereas Max-Ent and CRF (both discriminative) are represented using undirected graphical models.

Relation between NB, HMM, MaxEnt and CRF

So Conditional Random Fields or CRF is the sequence version of the Maximum Entropy Algorithm and the discriminative version of the HMM algorithm. Whereas in Max-Ent algorithm, we were using the following expression for the conditional probability of the class labels given the inputs :

p(Y|X)=\frac{exp(\sum_{i=1}^m{\lambda}_if_i(X, Y))}{\sum_{Y'}exp(\sum_{i=1}^m{\lambda}_if_i(X, Y'))}

In CRF, we have one output label corresponding to each observation and the feature functions are modified to include dependency on labels from earlier observations too :

p(y_1, y_2, ..., y_n|X)=\frac{exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_1, y_2,..., y_{j-1}, j))}{\sum_{y}exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_1, y_2,..., y_{j-1}, j))}

The denominator in the above expression is the normalization factor where the outermost summation is over all possible sequence of output labels y_1, y_2, ..., y_n. Note that in Max-Ent classifier, the outermost summation was over all possible labels Y. In classification if there are M-classes, then there would be at max M possible values of Y. But in CRF, where we have a sequence of n-observations, the maximum number of possible sequence labels is M^n.

The feature functions f_i(X, y_j, y_1, y_2,..., y_{j-1}, j) indicates that the value of the feature function for the j-th observation in the sequence depends on all labels from 1 to the j-th index. Generally the label at the j-th position depends only on the label at the (j-1)-th position (Markov) or in extreme cases, on subset of labels in between 1 and the j-th position.

In a linear chain CRF, the feature functions looks like f_i(X, y_j, y_{j-1}, j), i.e. it only depends on the previous output label. This is similar to HMM, where the transition probabilities follows the Markov rule, i.e. given the previous label, the current output label is independent of all other output labels. But unlike HMM, where the emission probabilities p(x_j|y_j) assume that given the label for the current observation, the current observation is independent of other observations, CRF does not make any such assumptions on the observations x_j.

Two different feature function configurations for CRF

For a linear chain CRF :

p(y_1, y_2, ..., y_n|X)=\frac{exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_{j-1}, j))}{\sum_{y}exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_{j-1}, j))}

Typically most CRF packages uses the L-BFGS iterative algorithm to compute the optimum values of the feature weights {\lambda}_i. The loss function for CRF is the negative log likelihood of the training sequences.

L=-\sum_{(X, Y)}log(p(Y|X))+\frac{1}{2}\sum_{i=1}^m\frac{{\lambda}^2_i}{{\sigma}^2}

L=-\sum_{(X, Y)}log(p(\frac{exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_{j-1}, j))}{\sum_{y}exp(\sum_{i=1}^m\sum_{j=1}^n{\lambda}_if_i(X, y_j, y_{j-1}, j))}))+\frac{1}{2}\sum_{i=1}^m\frac{{\lambda}^2_i}{{\sigma}^2}

where the last term is the regularization factor for the weights {\lambda}_i. The constant term {\sigma} helps to adjust the weights. Smaller values of {\sigma}, implies that the weights should be small and thus prevent overfitting on the training sequences. Larger values tries to fit the weights more on the training sequences.

Note that the objective function L to optimize in case of CRF has a globally optimum solution, whereas the training algorithm used for HMM, i.e. the Baum-Welch algorithm, is based on the EM algorithm, which converges to locally optimal solutions.

The gradient of the loss w.r.t. the weights {\lambda}_k comes out to be :

\frac{{\partial}L}{{\partial}{\lambda}_k}=E_1(f_k)-E_2(f_k)-\frac{{\lambda}^2_k}{{\sigma}^2}

If we can compute the values of E_1(f_k) and E_2(f_k) then we can use gradient descent to update the value of the weights {\lambda}_k

{\lambda}^{(t+1)}_k={\lambda}^{(t)}_k-{\eta}*[E_1(f_k)-E_2(f_k)-\frac{{\lambda}^2_k}{{\sigma}^2}]

where \eta is the learning rate.

E_1(f_k) is the expected value of the feature function in the training data. For example in a training dataset for addresses, if the feature f_k is given as follows :

f_k=1 if x_j="Bangalore" and y_j=CITY_NAME and y_{j-1}=STREET_NAME else f_k=0

Then the expected value of the feature f_k, E_1(f_k) is the total number of times the pattern

[x_j="Bangalore" and y_j=CITY_NAME and y_{j-1}=STREET_NAME]

occurs in all the training addresses divided by the total number of training instances, i.e. (X, Y) pairs.

E_1(f_k)=\frac{1}{N}\sum_{X'}\sum_{j=1}^nf_k(X', y_j, y_{j-1}, j)

The summation is only over X, because each observed sequence X is associated with only one particular label sequence Y in the training data.

E_2(f_k) is the expected value of the feature f_k as predicted by the model, which is given to be :

E_2(f_k)=\sum_{X'}\sum_{Y'}p(Y', X')\sum_{j=1}^nf_k(X', y_j, y_{j-1}, j)

Note that the summation is over both X and Y, since the model would associate each observed sequence X' to all possible label sequences Y' with certain probabilities. Writing in terms of the conditional probability.

E_2(f_k)=\sum_{X'}\sum_{Y'}p(X')*p(Y'|X')\sum_{j=1}^nf_k(X', y_j, y_{j-1}, j)

E_2(f_k)=\frac{1}{N}\sum_{X'}\sum_{Y'}p(Y'|X')\sum_{j=1}^nf_k(X', y_j, y_{j-1}, j), considering p(X)=\frac{1}{N}

The second summation from the left, implies that the term

\sum_{Y'}p(Y'|X')\sum_{j=1}^nf_k(X', y_j, y_{j-1}, j)

is over all possible sequence of labels on a fixed input sequence X'. As mentioned earlier that in CRF, for M possible types of labels, there could be maximum of M^n possible labelings for an n-length sequence, hence direct summation requires exponential number of computations.

Note that in a linear chain CRF, each current label depends only on the previous label. Instead of enumerating all possible label sequences Y, one can use a dynamic programming approach for computing E_2(f_k) efficiently (Time complexity of O(M^2n)).

Let \alpha_j(s|X) represent the score for the sequence from 1 to j ending with the output label y_j=s. Observe that, although there would be M^j different sequences from indices 1 to j, but each possible label for position j, uses the scores of M^{j-1} different sequences ending at position (j-1) to obtain the scores of all sequences ending at position j.

\alpha_j(s|X)=\sum_{s'{\epsilon}Y}\alpha_{j-1}(s'|X)*{\Psi}(X, s', s)

{\Psi}(X, s', s)=exp(\sum_{i=1}^m{\lambda}_if_i(X, y_j=s, y_{j-1}=s', j)), is the score of transitioning from label s' to label s.

\alpha_j(s|X) computes the scores using dynamic programming from the front of the sequence. Similarly one can compute the scores from the back of the sequence :

\beta_j(s|X)=\sum_{s'{\epsilon}Y}\beta_{j+1}(s'|X)*{\Psi}(X, s, s')

Note that the above recursion computes \beta_j based on the values of \beta_{j+1}, i.e. computing in reverse. The method of computing the sequence scores using \alpha_j(s|X) and \beta_j(s|X) is known as the Forward-Backward algorithm.

Forward Backward algorithm

Using the Forward-Backward algorithm, we can efficiently compute E_2(f_k),

E_2(f_k)=\frac{1}{N}\sum_{X'}\sum_{j=1}^n\sum_{s'}\sum_sf_k(X', y_j=s, y_{j-1}=s', j)*\alpha_{j-1}(s'|X)*{\Psi}(X, s', s)*\beta_j(s|X)

Once we have computed the optimum values of feature weights {\lambda}_k, we can use the trained model to make predictions on unlabeled test sequences. To compute the most likely label sequence for a test input sequence of length n, we use the Viterbi algorithm. The procedure is similar to the Forward-Backward algorithm.

In Viterbi, instead of computing the sums of scores ending at position j with label s as in F-B algorithm, we compute the maximum of the scores ending at position j with label s. Whichever label s gives the maximum score ending at position j, we consider that label as our output for the j-th position.

\gamma_j(s|X)=\text{max}_{s'{\epsilon}Y}\;\gamma_{j-1}(s'|X)*{\Psi}(X, s', s)

y^*_j=\text{arg max}_{s'{\epsilon}Y}\;\gamma_{j-1}(s'|X)*{\Psi}(X, s', s)

where y^*_j is the predicted label for the position j in the sequence.

The above recurrence relation assumes that the best possible sequence till position j with label s can be obtained by extending the best possible sequences from position (j-1) with label s', i.e. not all sequences ending at position (j-1) with label s' can be extended to position j to give the best sequence, but only the best sequence ending with each label. The above can be efficiently computed using dynamic programming.

Categories: MACHINE LEARNING

Tags: , , , , ,