Machine Learning, AI and Programming

BiLSTM-CRF Sequence Tagging for E-Commerce Attribute Extraction

In the last post we had used Conditional Random Fields (CRF) to extract attributes from e-commerce product titles and description. CRFs are linear models just like Logistic Regression. The drawback with linear models is that they do not take feature-feature interaction or higher order feature terms into account while building model. Linear models can under-fit on the data while too much non-linearity can lead to over-fitting. Non-linear models such as Neural Networks with proper regularization can be useful.

Instead of Feedforward Neural Networks we can use Recurrent Neural Networks or LSTM networks because they emit an output at each time step i.e. if we are working with text data, then RNN or LSTM outputs a label for each word. So this is how, we can train a sequence tagging model with LSTM.

  • Compute word vector representations for each word in the vocabulary. Vector representations could be word embeddings, one hot encoding, character embeddings etc.
  • Keep the number of LSTM units to some constant value e.g. 50 or the maximum number of words in any sentence.
    • Pad all sentences with zero vectors for which number of words is less than the number of LSTM units.
    • Truncate sentences with more words than the number of LSTM units.
  • Each LSTM unit emits a hidden layer representation which can be further passed through one or multiple dense layers to an output layer with N nodes, where N is the number of possible tags for a word.
    • Thus output layer for each LSTM unit emits probabilities for each output tag.
  • Each LSTM units takes as input the hidden layer representation from the previous LSTM unit apart from the current word vector representation.
    • Thus the output label prediction for the i-th word can take into account which words were present in 0 to (i-1)-th positions.
    • For e.g. in NER tagging, the tag 'B-Location' is more likely if the previous two words are 'located' and 'at'.
  • In the above LSTM network, we had assumed that the flow of hidden representation is from left to right i.e. i-th word can depend only on words at 0 to (i-1)-th position but not on words after it i.e. i+1 onwards. For e.g. the tag 'B-Person' for the current word is more likely if the next word is 'said' or 'did'.
    • To overcome this, we can use Bidirectional LSTM network. In Bi-LSTM, the tag for the current word can depend on words at 0 to (i-1)-th position as well as (i+1)-th position onwards.
    • The resultant hidden representation for each LSTM unit is thus concatenation of the forward and the backward hidden representations for that unit.

BiLSTM Network for Sequence Modelling

Now once we have the tag probabilities at each LSTM unit or for each word, we can then construct the tag sequence for the sentence in following two ways:

  1. Use the tag with the highest probability as the predicted tag for the current word. Do these independently for all words and obtain the tag sequence.
  2. Using the tag probabilities for each word and the transition probabilities of the tags, compute the best tag sequence using dynamic programming technique such as the Viterbi algorithm.

Let's consider the following sentence for NER tagging:

"Stan Lee lives in America"

The tags "I-Person" or "E-Person" are more likely for the word "Lee" given that the tag for the previous word "Stan" is "B-Person" and the first letter of "Lee" is capitalized. Similarly the tag for "America" is more likely to be "B-Location" because the previous two words are "lives" and "in" and the word previous to that has a tag of "E-Person".

Thus although LSTM uses neighboring word information while predicting tag for current word, we would be missing out on the neighboring tag information which is critical is many cases as seen by the above example. That is why instead of independently predicting tag for each word as in approach 1, we would generally consider approach 2 to be more suitable.

To keep things simple, we would be learning tag dependencies only for consecutive words, i.e. the tag for word at position i will only depend on the tag for the word at position i-1.

Given a sequence of L words and the tags:

[(x_0, y_0), (x_1, y_1), ..., (x_{L-1}, y_{L-1})]

the score of this sequence can be computed in multiple different ways. All we need are two matrices:

A[t_k, t_j]: Transition probability matrix, where each entry represents the transition probability from the tag t_k at position i-1 to tag t_j at position i.

We can compute the transition matrix quite easily from the training data. Simply put, if C[tk, tj] denotes the number of times tag tk is followed by tag tj in the training data, then:

A[t_k, t_j]=\frac{C[t_k, t_j]}{\sum_{r=1}^NC[t_k, t_r]}

B[i, t_j]: LSTM probability matrix, where each entry represents the LSTM network output probability of the tag t_j for the word at position i.

One possible way to assign a score to the sequence is:

s(X, Y)=\prod_{i=0}^{N-1}B[i, y_i]*A[y_{i-1}, y_i]

which is very similar to the probability assigned to a sequence in HMM.

In order to determine the best possible sequence Y_{\text{best}} for the input X:

Y_{\text{best}}=\text{argmax}_{Y'}s(X, Y')

Using the fact that the possible tags for the current word can only depend on the possible tags of the previous word, we can use a dynamic programming approach to determine Y_{\text{best}}.

Let Q(i-1, t_r) denote the score of the best tag sequence for words at positions 0 to i-1, with the tag for the word at position i-1 being t_r. Then to obtain Q(i, t_j), i.e. the score of the best tag sequence from 0 to i-th position with the tag for the i-th position being t_j, is:

Q(i, t_j)=\text{max}_{r=1}^N\left\{Q(i-1, t_r)*B[i, t_j]*A[t_r, t_j]\right\}

The above approach is known as the Viterbi algorithm.

Once we obtain the values Q for all positions i and all tags t_r, we can then backtrack to construct the best possible sequence in the following manner:

  • Compute u_{L-1}=\text{argmax}_{r=1}^NQ(L-1, t_r) to be the best tag for the last word.
  • Then for L-2 th word, u_{L-2}=\text{argmax}_{r=1}^N\left\{Q(L-2, t_r)*B[L-1, u_{L-1}]*A[t_r, u_{L-1}]\right\}
  • Repeat the above calculation for all words L-3, L-4, ... upto 0 in backward fashion
  • The final set of best possible tags obtained is Y_{\text{best}}=\left\{u_0, u_1, ..., u_{L-1}\right\}

Observe that in the approach suggested above, the LSTM model is trained independent of the Viterbi algorithm, i.e. during training of the LSTM model, it does not take the tag dependencies into account as it considers each tag to be independent and learn the parameters of each LSTM unit solely based on the mis-prediction of the tag for that unit only. Viterbi is only used at the time of prediction of the final tag sequence.

We can overcome this by using a CRF layer on top of the LSTM layer. The softmax layer after the LSTM output layer is replaced with the CRF layer.

BiLSTM-CRF Network for Sequence Modelling

CRF does not output tag probabilities or scores for individual words unlike a LSTM network. CRF gives a score for the entire sequence of tags at once.

The score function for a LSTM + linear chain CRF is given as:

s_{\text{CRF}}(X, Y)=\sum_{i=0}^{N-1}w_i.v_i+w'[y_{i-1}, y_i]

where v_i is the LSTM last dense layer output for the i-th word and w_i is the CRF weight vector corresponding to the i-th word, of the same dimension as v_i.

w_i.v_i is their dot product.

w'[y_{i-1}, y_i] is the weight of the transition from the tag y_{i-1} to y_i.

The unknown parameters of the network are the weights and biases of the LSTM network, and the weights w_i and w'[y_{i-1}, y_i] of the CRF layer.

The softmax layer is applied after the CRF layer. Thus the probability of the tag sequence:

P(Y|X)=\frac{e^{s(X, Y)}}{\sum_{Y'}e^{s(X, Y')}}

where in the denominator the summation \sum_{Y'} is over all possible tag sequences.

The learning objective of the network (BiLSTM+CRF) is to maximize the log-likelihood of the probability of the true tag sequence and thus:

L=\text{log}\;P(Y|X)=s(X,Y)-\text{log}(\sum_{Y'}e^{s(X, Y')})

For a sequence of length L and N possible tags, the summation would contain NL terms.

But we will assume that we are working with linear chain CRF in which the tag of the current word is only dependent on the tag of the previous word in the sequence, which makes computation of the second term:

Z=\sum_{Y'}e^{s(X, Y')}

easier using a dynamic programming approach known as the Forward-Backward algorithm. Refer to my earlier post of CRFs to understand how this works.

In short, let

F(X, Y)=e^{s(X, Y)} and f(X_i, Y_i)=F(x_0, x_1, ..., x_i, y_0, y_1, ..., y_i)

i.e. f(X_i, Y_i) is F but only with the first i words in the sentence, and let a(i-1, t_r) denote the sum of all tag sequences till position i-1, ending with the tag t_r, i.e.

a(i-1, t_r)=\sum_{y'_0, y'_1, ..., y'_{i-2}}f(X_i, y'_0, y'_1, ..., y'_{i-2}, y'_{i-1}=t_r)

Then we can write:

a(i, t_j)=\sum_{r=1}^N[a(i-1, t_r)*e^{w_i.v_i+w'[t_r, t_j]}]

Thus we can write the final summation as:

Z=\sum_{Y'}e^{s(X, Y')}=\sum_{r=1}^Na(L-1, t_r)

The unknown parameters can be computed using stochastic gradient descent technique during learning of the model.

Let's get our hands dirty. Recall that in the last post, we had started with a problem of extracting attributes from titles of laptops from an e-commerce website. For e.g. given the following product title for Mac-book Air.

"Apple MacBook Air MQD32HN/A 13.3-inch Laptop 2017 Core i5, 8GB, 128GB, MacOS Sierra, Integrated Graphics"

An attribute extraction algorithm should be able to return the following attributes:

{'brand':'Apple', 'screen-size':'13.3-inch', 'memory':'8GB', 'hard-disk capacity':'128GB', 'operating-system':'MacOS Sierra', 'manufacture-year':'2017', 'processor':'Core i5'}

The problem is as follows:

We have a huge catalog of products. For e.g. lets take laptops. There are about 20-30 million items listed under laptop product type. Each item in the catalog is uploaded by either our own seller or third-party sellers. Also the seller needs to specify certain attributes of the item but all attributes are not mandatory. As a result there are many items under laptop for which most or all the attributes such as brand name, screen size, ram size etc. are specified and there are also many items for which attributes are missing. Although the attributes are missing but these items have title and description specified that are used to display the item on the website. The title/desc. contains the attributes values embedded in the text.

Our task is to infer as many as possible attribute value pairs for these items, for which there are missing attribute details. The attributes we are mostly interested for this example are:

  1. Brand name
  2. Screen size
  3. Hard Disk Drive capacity
  4. RAM size
  5. Screen resolution
  6. Processor speed
  7. Processor type

The approach we would take is to learn a model with whatever attribute value pairs are available in the catalog data and then use this model to infer the missing attribute value pairs. For this we also need to evaluate how well our model is learning.

  • Using the available attribute value pairs present in the catalog, tag the tokens in the title with the appropriate tags.
    • Tagging follows the SBIOE encoding format. For multiword attribute values, we use B-I-E prefixes but for single words we use the prefix S, i.e. 'Hewlett Packard' would be tagged as (B-Brand, E-Brand) whereas 'Apple' would be tagged as (S-Brand).
    • Mapping the attribute values from the catalog entries to the corresponding token(s) in the title requires some work - lemmatizations, substring matching, subsequence matching, regex matching etc. Will skip this part for the post, you can check the codes out here.
  • We have taken out 140K items under laptop. 80% of the data is used for training the model and remaining 20% for validation.
  • Only the first 50 words from each sentence has been used for training and testing our model.
    • Sentences having more than 50 words are truncated.
    • Sentences having less than 50 words are padded with the text 'PAD_TXT'.
    • Corresponding tag labels are padded with 'O' tags.
  • The performance of the model is evaluated based on precision, recall and F1-score for each attribute and also overall. The scores are computed for whole chunk (B-Brand, I-Brand, I-Brand, ..., E-Brand) and not for individual B-Brand or I-Brand or E-Brand and so on.
  • We also compute the tagging accuracy for the entire sentences as well.
  • We have used the following sequence models and compared them against the performance metrics:
    • Only CRF with hand-crafted features.
    • Bi-directional LSTM that independently predicts tag for each word.
    • Bi-directional LSTM + CRF

For CRF training, the data consists of 'sentences' and 'labels'. 'sentences' is a list of list data structure where each element of 'sentences' is a list of tokens. Similarly 'labels' is also a list of list where each element of 'labels' is a list of tags for the corresponding tokens in 'sentences'.

For our text:

"Apple MacBook Air MQD32HN/A 13.3-inch Laptop 2017 Core i5, 8GB, 128GB, MacOS Sierra, Integrated Graphics"

The corresponding element in 'sentences' would be:

['Apple', 'MacBook', 'Air', 'MQD32HN/A', '13.3-inch', 'Laptop', '2017', 'Core', 'i5', '8GB', '128GB', 'MacOS', 'Sierra', 'Integrated', 'Graphics']

and the corresponding labels would be:

['S-Brand', 'O', 'O', 'O', 'S-size', 'O', 'O', 'B-proc', 'E-proc', 'S-ram', 'S-hdd', 'O', 'O', 'O', 'O']

The functions for training CRF:

'train_crf_model' is the method that is called with the training sentences and training labels. First we generate the hand crafted features for each word in training sentences and then using the 'pycrfsuite' library train the CRF model.

Note the 'get_word_features' function, which has been hand-crafted keeping the "structure and morphology" of the attributes in mind.

The methods being used to compute the precision, recall, f1-score and accuracy are as follows:

'get_classification_score' is the method being called to return precision, recall and f1-score for individual attributes as well as the overall numbers (which is the weighted sum). Observe that the 'get_classification_score' method computes the attribute chunks first by calling the 'get_sequences' method and then computes the true positive, false positive and false negative values.

'get_accuracy' method just takes two sequences at time (the predicted vs. the actual) and compares them word to word to see if they are the same or different.

For our CRF model, the overall precision, recall, f1-score and accuracy values are as follows:

  • Precision = 89.5%
  • Recall = 93.2%
  • F1-Score = 91.3%
  • Sequence Accuracy = 58.5%

Next, we come to implementing BiLSTM networks for sequence modelling. For sequence modelling with LSTM, we need to change the input and output (labels) format to suit the network structure.

Instead of list of tokens for each sentence as input, the LSTM network takes the indices of the token. Thus we create an inverted index of the tokens in the vocabulary and substitute the token with its corresponding index.

Similarly the output tags are transformed into categorical labels. Following is the BiLSTM model class definition:

Instead of directly passing word embeddings as inputs to the LSTM network, we introduce an Embedding layer after the input layer. 'TimeDistributed' is used to replicate the dense layer and the output layer at each of the LSTM units. The training is run for 15 epochs with a batch size of 32.

For inference, the model predicts tag probability distribution at each word, so we use the 'pred2label' function to only get the tag with the highest probability. Thus the final prediction is a sequence of tags for a sentence. 'tag_inverse_transformer' is an index that maps the index of a tag to its corresponding value.

The performance of the BiLSTM model is as follows:

  • Precision = 94.2%
  • Recall = 94.8%
  • F1-Score = 94.5%
  • Sequence Accuracy = 74%

There is a considerable improvement in the overall performance numbers by using BiLSTM network over the CRF model and that too without using any hand-crafted features. Note that I have not used Viterbi decoding on top of the predicted tag distribution, but it can be easily implemented using the outputs.

Finally we come to implementing the BiLSTM-CRF network, where we replace the last TimeDistributed softmax output layer with a CRF layer from the "keras_contrib" package. We will not re-implement a class for BiLSTM-CRF architecture but just inherit the BiLSTM class and overwrite the 'build_model()' function only.

The inputs and outputs are the same for this network. The performance of the BiLSTM-CRF model is as follows:

  • Precision = 94.4%
  • Recall = 95.1%
  • F1-Score = 94.7%
  • Sequence Accuracy = 75%

There is not much performance improvement in BiLSTM-CRF as compared to only BiLSTM implying that for our problem, the neighboring tag information is not as much helpful as the word features themselves and the neighboring words information which is already captured by the BiLSTM network.

The full code is hosted on Github.


Edit 1:

I tried to apply Viterbi algorithm to the BiLSTM network output tag distribution. The code for the Viterbi decoding is as follows:

'get_transition_probs' function is used to get the transition probabilities from tags t_j to t_k. Also the functions computes the start probabilities for each tag, i.e. the probability of the tag being the first tag in a sentence. This is important because in the 'viterbi_decoding' function, transition probabilities are only applicable from the 2nd word onwards in a sentence. For the first word in a sentence we use the 'start_probs' instead of 'transition_probs'.

But the numbers (apart from precision) are not that good after applying the Viterbi step:

  • Precision = 96.1%
  • Recall = 89.5%
  • F1-Score = 92.2%
  • Sequence Accuracy = 67%


Tags: , , , , , ,