Stokastik

Machine Learning, AI and Programming

Attribute Extraction from E-Commerce Product Description

In this post we are going to look into how one can use product title and description on e-commerce websites to extract different attributes of the product. This is a very fundamental problem in e-commerce which has widespread implications for Product Search (search filters), Product Matching (matching same items from different sellers), Product Grouping (grouping items by variants such as size and color), Product Graph (relationship between products based on attributes) and so on.

Attribute Values as Search Filters

For e.g. given the following product title for Mac-book Air.

"Apple MacBook Air MQD32HN/A 13.3-inch Laptop 2017 Core i5, 8GB, 128GB, MacOS Sierra, Integrated Graphics"

An attribute extraction algorithm should be able to return the following attributes:

{'brand':'Apple', 'type':'MacBook Air', 'model':'MQD32HN/A', 'screen-size':'13.3-inch', 'memory':'8GB', 'hard-disk capacity':'128GB', 'operating-system':'MacOS Sierra', 'manufacture-year':'2017', 'processor':'Core i5'}

There are multiple different approaches for solving the above problem.

The simplest approach is to use regular expressions to extract the desired phrases and values. For e.g. a simple regex to extract to the phrase containing the screen size from the above text is:

'(\d+\.\d+|\d+)-inch'

Similarly a regex to extract the RAM size would be '(\d+\.\d+|\d+)GB' and so on. But very soon we realize that these simple regexes are not sufficient to capture all variations.

  • What if the screen size is expressed in terms of inch symbol " or just 'in' instead of 'inch', or a space between the actual number and the unit ?
  • Both RAM and HDD size can be expressed in GB's, so if the text doesn't contains the RAM size but only HDD size, we would report 512 GB as RAM size.
  • What is instead of 'GB' as the unit, it is expressed as 'Gigabytes' or in MB or TB ?

The only advantage regular expressions has over other machine learning methods is that it is unsupervised and with large enough number of regexes it can work pretty well on any size of data without any manual tagging.

Later we will see that we can use regex for distance supervision for other machine learning models.

The next possible approach is to use supervised classification models. I will deliberately take a different example to illustrate this.

'Perfect Creations Men's Cotton And Leather Full Sleeve Blue Color Hooded T-Shirt'

The attribute that I am interested in, is color of the product.

Although it looks pretty simple to identify words in text that are color names, but there are challenges also:

  • The text might also contain names of available colors for variants of the product, in that case it is easy to confuse the color of the actual product with the available colors.
    • 'also available in red, green and yellow...'
  • The color name could be some nearby variant of the base color, for example 'indigo' or 'navy' corresponds to blue, 'tan' corresponds to 'biege' color and so on.
  • Identifying the product as 'Multicolor' (more than 3 colors).

We can use a supervised classification model such as Logistic Regression or SVM or Gradient Boosted Trees to create a classification model.

  • The input would be a TF-IDF matrix of features such as words and n-grams.
  • The output would be a vector of class labels corresponding to the actual color of the product. Note that this would be multi-label classification problem since a product might contain multiple colors.

Now coming back to our earlier example of the mac-book, observe that the classification model based approach would not work well for the attributes of the laptop product type.

The classification approach would only work in cases where the possible values of the attributes are bounded. In case of 'color', the number of possible colors is fixed, whereas in the case of brand name for a laptop, some new company might come up tomorrow that manufactures low cost gaming laptops. Similarly, for RAM size, Apple might decide to ship its mac-book with a 32 GB RAM or the model number which is dynamically set by the manufacturer and so on.

Classification models for such attributes might work well on existing data but would fail when newer values of attributes are encountered. Also another drawback is that, classification model can only work one attribute at a time i.e. one model for each attribute.

Instead of predicting the screen size based on the entire text, it makes much more sense to identify a chunk in the text that contains the value of the screen size and then extract the value using a regex from the chunk. E.g.

'MQD32HN/A 13.3-inch Laptop'

This is more robust since it is independent of the value of screen size. It is similar to regex but in this case we would be 'learning regexes'.

This brings us to our next approach i.e. sequence prediction models. Sequence based models are quite common in NLP such as in POS tagging or Named Entity Recognition or Speech Recognition or Machine Translation etc. We have already explored POS Tagging and NER in our earlier posts.

Unlike classification where the entire text is given a single class label, in sequence prediction, each word in a sentence can have a class label. Thus apart from using the current word for predicting the class label for that word, sequence models can use information regarding previous or next or any other neighboring word information also.

So if the current word contains a number and the next word is 'inch' then most likely the current word is the screen size. Similarly, if the next word is 'GB' followed by the word 'RAM' then the current word is most likely the RAM size and so on.

Two of the most popular sequence prediction algorithms are HMM and CRF, both of which we had explored in earlier posts. As it is known that CRF almost always performs better than HMM, we will be using CRF to illustrate concepts further in this post.

To train a sequence model, first of all we need to tag each word with the appropriate attribute.

For e.g. for the above title, we can have the following tags:

('Apple', B-brand), ('MacBook', B-name), ('Air', I-name), ('MQD32HN/A', B-model), ('13.3-inch', B-screen-size), ('Laptop', O), ('2017', B-manufacture-year), ('Core', B-processor), ('i5', I-processor), ...and so on.

The tagging is consistent with BIO encoding scheme, where if an attribute 'attr' comprises of 'm' tokens in the title, then the tags [B-attr, I-attr, I-attr, ..., I-attr] indicates the first token is the beginning ('B' prefix) and the following m-1 tokens are intermediate ('I' prefix).

'O' denotes all tokens that are not part of any attribute. We can also have a separate 'E' tag to denote the end token for an attribute.

This is similar to the problem of Named Entity Recognition (NER).

Once we have the attribute tags for each word in the product title + description:

  • Train CRF model on this data
  • Use the tagger to predict the attribute tag for each word for an un-tagged product title.
  • Once we extract the phrase from the text using the tags [B-attr, I-attr, I-attr,... E-attr], we can use regex to extract the actual value of the attribute or normalize the phrase into a form that can be displayed on the website.
    • E.g. for screen size filter to be displayed as 11-14". Normalize the predicted phrase into this form, if the extracted value lies within this range.

Without getting into the theoretical details of CRF, we will be directly jumping into implementation of sequence prediction models on proprietary datasets. Instead of a multi-attribute example, we will start with a single attribute example : Predicting suitable age range for a product.

Age Range for Toys

Let's take the following example of a toy:

'Melissa & Doug Deluxe Picnic Basket Fill and Spill Soft Baby Toy - Weight - 7 oz. Assembly required - no assembly required. Dimensions - 6.7" l x 11" w x 11.5" h. Age recommendation - 6 months to 5 years.'

The above description for the product contains the suitable age range for the product i.e. 6 months to 5 years. As said earlier that, we can use regex to extract attribute values, for age range one can come up with following regexes:

  • (\d+)-(\d+) years
  • (\d+)-(\d+) months
  • (\d+)months-(\d+) years
  • (\d+) to (\d+) years
  • (\d+) to (\d+) months
  • (\d+)months to (\d+) years
  • ..and so on

But we encounter several examples which are not captured by our pre-defined regexes. For e.g.

  • Not recommended for children below 3 years.
  • Recommended for kids 5 years & up.
  • Suitable for children not below 10 months and not recommended for ages above 8 years.
  • Age 8+
  • ...and so on.

The complexity of the regexes will increase over time and will become difficult to maintain and track down bugs in one or two regexes. We can apply two possible solution here:

  1. Distant Supervision using Regex
    • Use simple regexes to tag few examples (coverage say 20-30%) with BIOE encoding and then when it is trained by CRF, the CRF will be able to capture few more patterns.
    • With the predicted patterns verified by humans, we can tag more examples for further training and so on.
  2. Human annotation
    • Instead of tagging with regexes, let manual annotaters point out the exact phrase in the text that contains the age range.
    • Then we can use string matching to identify the position in the text where the phrase is located and tag the phrase using BIOE encoding.

For this example we will go with the second approach since we have some manual tagging already available.

Let's say that the annotater had tagged 'Age recommendation - 6 months to 5 years' as the phrase containing the age range, then we can generate the following BIOE encoded training data.

[('Melissa', 'O'), ('&', 'O'), ('Doug', 'O'), ('Deluxe', 'O'), ... ('Age', 'B'), ('recommendation', 'I'), ('-', 'I'), ('6', 'I'), ('months', 'I'), ('to', 'I'), ('5', 'I'), ('years', 'E')]

The annotation is done in such a way that the phrase tagged by the individual does not contain any other number apart from min and max age.

The age range filters for display on website has the following format:

  • 2 - 5 years
  • 2 months - 5 years
  • 2 years & up
  • 2 - 7 months
  • 2 months & up
  • 0 - 6 months

In order to evaluate our model on test data, we need to normalize both the predicted age range and the actual age range into a common format. For the common format we have chosen (min_age, max_age) as the format.

Given a small enough phrase or sentence containing the age range in it and no other numbers, the following code extracts the age range into the format (min_age, max_age):

For e.g. given the phrase 'Age recommendation - 6 months to 5 years' as input to the above function it returns (0.5, 5) as output.

The function 'get_tokens' normalizes the text by replacing the pattern specified in 'to_replace' with spaces, does lowercasing, lemmatizations and then splits the resulting string on spaces.

The logic is to extract all the numbers from the text as well as any mention of 'month' or 'year'. In many instances, the age is written in words for e.g. 1 as 'one' or 2 as 'two' and so on, so to handle them (at-least for 1 to 9), we have the 'word_numbers' pattern.

If there is only one number then we have only the min_age, the max_age is by default set to 100. If the period is 'month', we divide the number by 12 else we keep it as it is. Now if there are two numbers present then we have both min and max age. All we need is therefore a logic to assign one of the number as min age and the other one as the max age.

And lastly we also need to normalize 'month' and 'year' into a common format i.e. year. So any 'months' in age is converted into 'years' by dividing it up by 12.

The same above method is used to normalize the predicted and the actual age ranges into the format (min_age, max_age).

There are two edge cases to handle. The first where the sentence do not contain any numbers. In this case we should return (-1, -1) and for the other case where the sentence again do not contain any number but contains the phrase 'all age', we should be returning (-1, 100).

Given a phrase tagged by the annotater, the following function locates the phrase in the text and does the BIOE encoding on it.

For e.g. for the above string, the normalized and tokenized sentence would be:

['melissa', '&', 'doug', 'deluxe', 'picnic', 'basket', 'fill', 'and', 'spill', 'soft', 'baby', 'toy', '-', 'weight', '-', '7', 'oz.', 'assembly', 'required', '-', 'no', 'assembly', 'required', u'dimension', '-', '6.7', 'l', 'x', '11', 'w', 'x', '11.5', 'h', 'age', 'recommendation', '-', '6', u'month', 'to', '5', u'year']

The tokenized tagged phrase would be:

['age', 'recommendation', '-', '6', 'month', 'to', '5', 'year']

And the labels per word would be:

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B', 'I', 'I', 'I', 'I', 'I', 'I', 'E']

Note that there could be multiple phrases in the text that contains the age range in it. All of them could have the same age range value expressed similarly or differently or they can be different age ranges due to inconsistencies from multiple sellers for the same product. For training, we will tag all such phrases in the text with the BIOE encoding. But while doing prediction we will use a logic to give only a single age range as output.

Next we come to the part of training the CRF. To train CRF, we need to define feature functions. For each word in the text, we define a list of potential features that could help the CRF algorithm to learn which words should be tagged as 'B', which as 'I', which as 'E' and which as 'O'.

The advantage of the CRF algorithm is that it gives flexibility to use feature information about any word in the sentence w.r.t. the current word, that is why the 'word2features' function takes as input the whole sentence apart from the index of the current word.

Observe that the features are selected in such a way that they are relevant for age range. Then we can define the function to train the CRF algorithm:

We are saving the model in the file named 'age_range.crfsuite'. The CRF algorithm uses Elastic-Net regularization with 'c1' i.e. constant factor for L1-regularization as 1.0 and 'c2' i.e. constant factor for L2-regularization as 0.01. The algorithm is trained using the L-BFGS optimization approach.

In order to train and test the model, we need to split the data into training and testing. We use 20% of the data for testing and remaining for training.

The following functions is used to predict the BIOE tags for test sentences and then extract all the phrases corresponding to the BIOE tags.

Once we extract the phrases containing the age ranges in a sentence, we need to normalize these phrases into (min_age, max_age) format. Also for each sentence we need to output only a single (min_age, max_age).

So if a sentence predicts multiple (min_age, max_age), the logic used to roll-up is to take the maximum value of min_age and the minimum value of max_age as the output. This ensures we consider only the smallest intersection of all possible age ranges.

For e.g. if we predict the age ranges (5, 15) and (7, 20) for the same product, then we output (7, 15) as the predicted age range. If the multiple age ranges do not intersect at all i.e. conflicting age ranges, then we output (-1, -1).

Once we have all the nuts and bolts, we train the CRF model on the training data and use it to predict the age ranges for the test data and compute the prediction accuracy. With around 15,400 tagged examples, 12,300 has been used for training and the remaining for testing. The overall accuracy on the test set comes to around 91%.

Given this sentence as input to the tagging model:

'Melissa & Doug Deluxe Picnic Basket Fill and Spill Soft Baby Toy - Weight - 7 oz. Assembly required - no assembly required. Dimensions - 6.7" l x 11" w x 11.5" h. Age recommendation - 6 months to 5 years.'

The tagger correctly predicts the phrase that contains the age range value:

Output is:

[['age', 'recommendation', '-', '6', u'month', 'to', '5', u'year']]

We can also analyze the CRF model for the feature weights. To check for which transitions were most likely and which transitions was most unlikely:

The output is:

Top likely transitions:
O      -> O       3.047245
I      -> E       1.471798
I      -> I       1.111391
B      -> I       0.902568
E      -> O       0.407930
B      -> E       0.403408
O      -> B       0.063766
I      -> B       -1.520888
B      -> B       -1.681932
B      -> O       -2.687051
E      -> E       -4.548171
E      -> I       -5.533213
I      -> O       -6.430414
O      -> E       -7.561383
O      -> I       -10.419938

Top unlikely transitions:
O      -> O       3.047245
I      -> E       1.471798
I      -> I       1.111391
B      -> I       0.902568
E      -> O       0.407930
B      -> E       0.403408
O      -> B       0.063766
I      -> B       -1.520888
B      -> B       -1.681932
B      -> O       -2.687051
E      -> E       -4.548171
E      -> I       -5.533213
I      -> O       -6.430414
O      -> E       -7.561383
O      -> I       -10.419938

The output is quite intuitive and self-explanatory that O->O or B->I or I->I or I-E transitions are highly likely whereas O->I or O->E or I->O and so on are highly unlikely. But the interesting analysis would be to understand which feature for which tag are highly important and highly un-important.

The output is:

Top positive:
7.210141 B      curr_word=none
5.574125 O      BOS
4.729685 I      prev_word=not
4.687832 O      EOS
3.884477 O      prev_word=none
3.777790 I      next_word=3-year-old
3.653495 E      prev_word=all
3.554828 E      curr_word=age
3.466775 E      curr_word=up
3.431339 E      curr_word=year
3.246470 O      prev_word=-recommended
3.137251 B      curr_word=not
3.109862 E      curr_word=month
3.089459 E      curr_word=yr
3.073495 E      curr_word=old
3.040106 B      curr_word=minimum
2.840459 E      prev_word=or
2.803774 B      curr_word=mario
2.708541 O      next_word=years-
2.631536 O      prev_word=-for
2.622266 O      prev_word=not
2.542330 B      curr_word=0-12
2.533440 B      curr_word=applicable
2.516693 O      next_word=olditem
2.487639 O      next_word=encourage
2.466949 B      curr_word=recommended
2.424408 B      curr_word=suitable
2.410407 I      prev_word=it
2.403337 B      next_word=range
2.364317 I      next_word=yr
2.338016 I      next_word=plus
2.309131 E      curr_word.is_range=True
2.284358 B      curr_word=birth
2.275808 B      curr_word=all
2.187448 I      curr_word=-
2.185317 B      next_word=chocking
2.185054 E      curr_word=older
2.170236 I      next_word=old
2.116336 I      curr_word=to
2.101092 E      curr_word=above
2.077766 O      next_word=yearspackage
2.073384 B      prev_word=setting
2.062467 B      curr_word=suit
2.062067 B      curr_word=above
2.040886 E      curr_word.has_plus=True
2.028837 B      curr_word=great
2.021485 O      prev_word=iphone
2.017083 B      next_word=month
2.012065 I      next_word=month
2.003435 O      next_word=yr

Top negative:
-1.119860 E      next_word=and
-1.125889 O      prev_word=item
-1.156638 O      curr_word=suitable
-1.181094 I      prev_word=and
-1.184761 O      curr_word=older
-1.211514 O      curr_word=s
-1.238000 E      next_word=12
-1.239147 O      prev_word=scary
-1.242104 E      next_word=yearspackage
-1.252317 E      next_word=thomas
-1.267934 O      prev_word=ministry
-1.279357 O      curr_word.is_range=True
-1.280520 O      curr_word=recommend
-1.282875 O      next_word=under
-1.290126 I      next_word=suitable
-1.303816 O      next_word=caution
-1.307437 O      curr_word=8
-1.319744 O      curr_word=age
-1.323526 B      prev_word=of
-1.343475 E      next_word=to
-1.378808 E      curr_word=and
-1.407234 O      curr_word=y
-1.445527 O      prev_word=friend
-1.450950 I      prev_word.has_plus=True
-1.455522 E      prev_word=for
-1.459328 O      curr_word=mo
-1.493661 O      next_word=wooden
-1.496767 B      prev_word=6
-1.501034 O      curr_word=yr
-1.533773 O      prev_word=mode
-1.552368 E      next_word=of
-1.571740 B      prev_word=level
-1.594461 B      prev_word=for
-1.614495 B      prev_word=the
-1.675542 E      next_word=old
-1.717014 O      curr_word=over
-1.765325 E      next_word=group
-1.768317 O      prev_word=entertain
-1.847484 I      next_word=recommended
-1.891994 B      prev_word=suggest
-2.065938 O      next_word=chocking
-2.139197 O      prev_word=apply
-2.180246 O      curr_word=intended
-2.293753 B      prev_word=suggested
-2.305247 O      curr_word=above
-2.330819 O      curr_word=month
-2.397254 O      curr_word=recommended
-2.659820 E      next_word=year
-3.678541 O      curr_word.has_plus=True
-7.152727 O      curr_word=none

For the 'B' tag, the words 'none' or 'not' are important because these words are negations that specifies the item is not recommended below a certain age. Similarly the phrase 'all age' is deemed quite important for the E tags. Some other features like presence of 'year' or 'month' or 'old' or 'minimum' or 'recommended' etc. can be intuitively related to sentences containing age ranges.

In the next post, we would be working with multiple attributes per product type as in the case of the laptop example.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , , ,