Stokastik

Machine Learning, AI and Programming

Understanding Conditional Random Fields

Given a sequence of observations, many machine learning tasks require us to label each observation in the sequence with a corresponding class (or named entity) such that the overall likelihood of the labelling is maximized. For example, given a english sentence, i.e. a sequence of words, label each word with a Part-Of-Speech tag, such that the combined POS tag of the sentence is optimum.

"Machine Learning is a field of Artificial Intelligence"

Hidden Markov Models (HMM) is a generative machine learning algorithm, that can compute two different quantities for us.

  • The probability of a sequence of labels, e.g. what is the probability of the following sequence of POS tags for the above sentence [NNP, NNP, VBZ, DT, VBZ, IN, NNP, NNP] ?
  • The best possible sequence of labelings, that maximizes the probability of the overall prediction.

To compute the joint probability of a sequence of labels y_1, y_2, y_3 with the sequence of observation x_1, x_2, x_3, by the application of the chain rule in probability :

P(x_3, y_3, x_2, y_2, x_1, y_1)=P(x_3|y_3, x_2, y_2, x_1, y_1)*P(y_3|x_2, y_2, x_1, y_1)*P(x_2|y_2, x_1, y_1)*P(y_2|x_1, y_1)*P(x_1|y_1)*p(y_1)

Assuming that each observation is independent of other observations and only dependent on its label and each label is only dependent on the previous label, we can write the above equation as :

P(x_3, y_3, x_2, y_2, x_1, y_1)=P(x_3|y_3)*p(y_3|y_2)*p(x_2|y_2)*p(y_2|y_1)*p(x_1|y_1)*p(y_1)

The conditional probability of the labeling P(y_1, y_2, y_3|x_1, x_2, x_3), is given as :

P(y_1, y_2, y_3|x_1, x_2, x_3)=\frac{P(x_3, y_3, x_2, y_2, x_1, y_1)}{P(x_1, x_2, x_3)}

The denominator is just the product of the relative frequency of each observation in the training data, i.e.

P(x_1, x_2, x_3)=P(x_1)*P(x_2)*P(x_3), since the observations are assumed to be independent of each other.

The quantities P(x_i|y_i) are called the emission probabilities and P(y_i|y_{i-1}) are the transition probabilities and these are determined from the training data.

Coming to the second part, that is to determine the best sequence of labels, that maximizes the joint probability of the labels and the observations, one can enumerate all possible sequence of labels and then choose the one with the highest joint probability.

Assuming there are M different possible labels and the length of the observation is n, then we need to evaluate M^n possible sequence of labels in order to choose the best among them. But since the label for a particular observation is only dependent on the label of the previous observation, we can use a dynamic programming approach to solve this in polynomial time O(M^2N). This approach is known as the Viterbi algorithm.

Whereas HMM is a generative algorithm, Conditional Random Fields (CRF) are discriminative algorithms for same kind of tasks. Two advantages of CRF over HMM is that, in CRF :

  • We do not assume that each observation is independent of other observations. This is very important as some words such as "is" might be very frequent after the word "this" or "are" might frequently occur after "we" or "these" and so on. HMM makes errors in this estimation as it assumes that each word is equally likely to be followed by another word.
  • There could be arbitrary dependencies between labels unlike HMM where each label is dependent only on the previous label.
  • The features for a particular word can depend on any other subset of words in the sequence. HMM cannot do this due to the independence assumption.
  • In the presence of hidden states, the training objective in HMM uses the EM algorithm, which finds local minima/maxima, but the training objective in CRF is concave and has a globally optimum solution. But unlike HMM, CRF cannot work with hidden states i.e. without labels, it is purely a supervised algorithm.

CRF can also be used to compute the above two quantities for a given sequence labelling problem similar to HMM. Only difference is that, how it estimates the probabilities for sequence of labels.

CRF first estimates a score for a sequence (which is not a probability). For a linear chain CRF, i.e. a CRF where the current label depends only on the previous label :

\text{score}(y_1, y_2, y_3|x_1, x_2, x_3)=\sum_{i=1}^3\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i)

where f_j(x, y_i, y_{i-1}, i) is a feature function, that normally takes binary values, which is a function of the sequence of observation, the current label, the previous label (linear chain) and the index of the observation 'i'.

\lambda_j are the weights for these feature functions. High positive value of this weight indicate that the feature is relevant, whereas high negative value indicate that the feature is not relevant.

Higher score implies that the label assigned is a good one.

Note that these feature functions can have complex dependencies between labels, but linear chain is the most simple and commonly used variant as it is similar to HMM in terms of the Markov assumption.

Examples of feature function, w.r.t. the above sentence might be :

f_j(x, y_i, y_{i-1}, i)=1 if \text{isTitle}(x_i)=TRUE and y_i=NNP else f_j(x, y_i, y_{i-1}, i)=0

f_j(x, y_i, y_{i-1}, i)=1 if x_i[:-3]="ing" and y_i=VBZ else f_j(x, y_i, y_{i-1}, i)=0

f_j(x, y_i, y_{i-1}, i)=1 if x_i="Learning" and x_{i-1}="Machine" and y_i=NNP else f_j(x, y_i, y_{i-1}, i)=0

The transition feature functions will be defined as :

f_j(x, y_i, y_{i-1}, i)=1 if y_i=VBZ and y_{i-1}=NNP else f_j(x, y_i, y_{i-1}, i)=0

and so on...

In text classifications, these feature functions would simply be the TF-IDF scores of the individual features (such as words, n-grams etc.), but in NLP, these feature functions enables us to represent very sparse features.

Once we obtain the above scores, we can easily transform these scores into probabilities using softmax function :

P(y_1, y_2, y_3|x_1, x_2, x_3)=\frac{\text{exp}(\sum_{i=1}^3\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i))}{\sum_{y_1, y_2, y_3}\text{exp}(\sum_{i=1}^3\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i))}

The denominator is just the normalization constant, where the summation is over all possible labelling for a 3-sequence, i.e. M^3 possible values.

In order to compute the conditional probabilities in the above equation, we need to know the values of the unknowns \lambda_j.

The values of the unknowns \lambda_j are computed by maximizing the conditional log likelihood of the labeled training sequences and then solve for the unknowns using iterative gradient ascent techniques. The conditional log likelihood is given by the formula :

L=\frac{1}{N}\sum_{(x, y)}log(P(y|x))-\frac{1}{2}\sum_{j}{\lambda}^2_j

The outermost summation is over all observation and label sequence pairs (x, y) in the training data. The last term is the L2 regularization factor for the weights {\lambda}_j. Substituting back the values of P(y|x), we get :

L=\frac{1}{N}\sum_{(x, y)}log(\frac{\text{exp}(\sum_{i}\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i))}{\sum_{y}\text{exp}(\sum_{i}\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i))})-\frac{1}{2}\sum_{j}{\lambda}^2_j

L=\frac{1}{N}\sum_{(x, y)}log(\text{exp}(\sum_{i}\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i)))-\frac{1}{N}\sum_{(x, y)}log(\sum_{y}\text{exp}(\sum_{i}\sum_{j}\lambda_jf_j(x, y_i, y_{i-1}, i)))-\frac{1}{2}\sum_{j}{\lambda}^2_j

L=Q(\lambda)-R(\lambda)-\frac{1}{2}\sum_{j}{\lambda}^2_j

The gradient of the likelihood w.r.t. the unknowns \lambda_j is computed to be :

\frac{{\partial}Q}{{\partial}{\lambda}_j}=\frac{1}{N}\sum_{(x, y)}\sum_{i}f_j(x, y_i, y_{i-1}, i)=\frac{1}{N}E_1[f_j]

\frac{{\partial}R}{{\partial}{\lambda}_j}=\frac{1}{N}\sum_{x}\sum_{y'}P(y'|x)\sum_{i}f_j(x, y'_i, y'_{i-1}, i)=\frac{1}{N}E_2[f_j]

Thus the gradient ascent rule for \lambda_j, would be given as :

\frac{{\partial}L}{{\partial}{\lambda}_j}=\frac{1}{N}*[E_1(f_j)-E_2(f_j)]-{\lambda}_j

{\lambda}^{(t+1)}_j={\lambda}^{(t)}_j+{\eta}*[\frac{1}{N}*[E_1(f_j)-E_2(f_j)]-{\lambda}^{(t)}_j]

where \eta is the learning rate.

The likelihood function defined above is a concave function, and hence the solution has a global maxima. Thus any approach taken to solve for \lambda_j would eventually converge to the globally optimum solution. Simplest approach would be to use Stochastic Gradient Descent (or Batch gradient descent) with momentum. Advanced techniques such as L-BFGS would speed up the convergence to the solution.

The first quantity E_1[f_j] is the expected value of the feature function f_j over all the training examples. For example, for the feature function :

f_j(x, y_i, y_{i-1}, i)=1 if \text{isTitle}(x_i)=TRUE and y_i=NNP else f_j(x, y_i, y_{i-1}, i)=0

The expected value of the above feature function would be the total number of words which are title (first letter capitalized) and the POS tag for the word is 'NNP'.

The second quantity E_2[f_j] is the expected value of the feature function f_j as predicted by the model itself. In the term :

\sum_{x}\sum_{y'}P(y'|x)\sum_{i}f_j(x, y'_i, y'_{i-1}, i)

The first summation \sum_{x} is over all sequence of observations 'x' in the training data, but the second summation is over all possible sequence of labelings (not only which are present in the training data) for a sequence of length as same as the observation, since the model will assign probability to each possible sequence of labels as an outcome.

As mentioned earlier in HMM, for M possible types of labels, there could be maximum of M^n possible labelings for an n-length sequence, hence direct summation would require exponential number of steps to compute the above summation.

Note that in a linear chain CRF, each current label depends only on the previous label. Instead of enumerating all possible label sequences 'y', one can use a dynamic programming approach for computing E_2(f_j) efficiently.

For a given observation sequence 'x', let \alpha_k(s|x) represent the score for the sequence from 1 to k ending with the output label y_k=s, i.e.

\alpha_k(s|x)=\sum_{y'_1, y'_2, ..., y'_{k-1}, y'_k=s}P(y'_1, y'_2, ..., y'_{k-1}, y'_k=s|x)\sum_{i=1}^kf_j(x, y'_i, y'_{i-1}, i).

Observe that, although there would be M^k different sequences from indices 1 to k, but each possible label for position k, uses the scores of M^{k-1} different sequences ending at position (k-1) to obtain the scores of all sequences ending at position k.

We can write the above expression for :

\sum_{y'_1, y'_2, ..., y'_{k-1}, y'_k=s}P(y'_1, y'_2, ..., y'_{k-1}, y'_k=s|x)

using chain rule of probability :

\sum_{s'}\sum_{y'_1, y'_2, ..., y'_{k-1}=s', y'_k=s}P(y'_1, y'_2, ..., y'_{k-1}|x) * P(y'_k=s|y'_{k-1}=s', x)

P(y'_k=s|y'_{k-1}=s', x) is the transition probability from the label s' for the previous word to the label s for the current word. Using Bayes' Rule, we get

P(y'_k=s|y'_{k-1}=s', x)=\frac{P(y'_k=s, y'_{k-1}=s'|x)}{P(y'_{k-1}=s'|x)}

Substituting back the expression for P(y'|x) into the above equation, the normalization term cancels out, and we remain with the following :

P(y'_k=s|y'_{k-1}=s', x)=\text{exp}(\sum_j{\lambda}_jf_j(x, y'_k=s, y'_{k-1}=s', k))

Thus we can write the expression for :

\alpha_k(s|x)=\sum_{s'}\alpha_{k-1}(s'|x)*P(y'_k=s|y'_{k-1}=s', x)

\alpha_k(s|x) computes the scores using dynamic programming from the front of the sequence. Similarly one can compute the scores from the back of the sequence :

\beta_k(s|x)=\sum_{s'}\beta_{k+1}(s'|x)*P(y'_{k+1}=s'|y'_k=s, x)

Note that the above recursion computes \beta_k based on the values of \beta_{k+1}, i.e. computing in reverse. The method of computing the sequence scores using \alpha_k(s|x) and \beta_k(s|x) is known as the Forward-Backward algorithm.

Forward Backward algorithm

Using the Forward-Backward algorithm, we can efficiently compute E_2(f_j),

E_2(f_j)=\sum_{x}\sum_{i}\sum_{s'}\sum_sf_k(x, y_i=s, y_{i-1}=s', i)*\alpha_{i-1}(s'|x)*P(y_i=s|y_{i-1}=s', x)*\beta_i(s|x)

Thus we can compute the value of the unknowns \lambda_j iteratively,

{\lambda}^{(t+1)}_j={\lambda}^{(t)}_j+{\eta}*[\frac{1}{N}*[E_1(f_j)-E_2(f_j)]-{\lambda}^{(t)}_j]

Once we have computed the optimum values of feature weights {\lambda}_j, we can use the trained model to make predictions on unlabeled test sequences. Similar to HMM, in order to compute the most likely label sequence for a test input sequence of length n, we use the Viterbi algorithm.

Some useful resources on Conditional Random Fields :

Categories: MACHINE LEARNING

Tags: , , , , , ,