Stokastik

Machine Learning, AI and Programming

Dynamic Programming in NLP - Longest Common Subsequence

In this second part of the series on my posts on Dynamic Programming in NLP, I will be showing how to solve the Longest Common Subsequence problem using DP and then use modified versions of the algorithm to find out the similarity between two strings. LCS is a common programming question asked in many technical interviews. Given two strings (sequence of words, characters etc.) S1 and S2, return the number of common words that are in the same sequence as in the original string.

For example, if

S1 = "can i transfer my wallet balance to my bank account"

S2 = "can i transfer my bank balance to my wallet"

Then the longest set of common words in a sequence (or longest common subsequence) is "can i transfer my balance to my". Although the words "wallet" and "bank" are common words but they cannot fit in the longest common subsequence.

Finding the length of the longest common subsequence between two strings is not a difficult programming exercise, given that we already know that DP technique is useful here. Similar to all DP problems, what do we memoize or cache in this problem, that can be re-used by different calculations ?

Let H[i][j] be the length of the longest common subsequence between the strings S1[:i] and S2[:j], i.e. the first 'i' words of S1 and the first 'j' words of S2. Then we can define the following recursive relation :

H[p][q] = H[p-1][q-1] + 1 if S1[p] = S2[q]  else  max(H[p-1][q], H[p][q-1])

Imagine that we have a binary matrix G of size NxM, where N is the number of words in string S1 and M is the number of words in string S2. Then by definition, G[i][j] = 1 if S1[i] = S2[j] else it is 0.

Starting at the bottom right corner (N-1, M-1), from each point (i, j) in the above matrix, we can go in three possible directions, left (i, j-1), or up (i-1, j) or the diagonal between them (i-1, j-1) to reach (0,0). We will obtain the same results if we reverse the direction, i.e. traverse from (0,0) to (N-1, M-1).

The LCS problem is same as the maximum number of 1's that we can collect by traversing along all possible paths by following the above rule of traversal. The blue shaded cells in the above diagram are the 1's collected along the path, which gives us the length of the longest common subsequence for the two strings.

The python program to do the same thing :

import collections
def lcs(s1, s2):
tokens1, tokens2 = s1.split(), s2.split()
cache = collections.defaultdict(dict)
for i in range(-1, len(tokens1)):
for j in range(-1, len(tokens2)):
if i == -1 or j == -1:
cache[i][j] = 0
else:
if tokens1[i] == tokens2[j]:
cache[i][j] = cache[i - 1][j - 1] + 1
else:
cache[i][j] = max(cache[i - 1][j], cache[i][j - 1])
return cache[len(tokens1) - 1][len(tokens2) - 1]
s1 = "can i transfer my wallet balance to my bank account"
s2 = "can i transfer my bank balance to my wallet"
print lcs(s1, s2)

LCS measure can be used as a metric for similarity between strings in various NLP tasks. For example, in Question Answering, a new unseen question can be looked up against and matched with an existing question and the answer for the best matching question can be returned as a probable answer.

Or in spelling correction, a probable misspelled word is looked up against a dictionary and then based on the LCS similarity, the best matching correct word is chosen.

One can use the following metric to normalize the similarity score between two strings :

Similarity = \frac{\text{LCS}(S_1, S_2)}{\text{max}(n, m)}, where n, m are the lengths of the strings S1 and S2 respectively

Observe that, instead of just using the number of common words between two strings as a similarity measure, LCS measure can also take word (character) order into account, which the intersection function cannot consider. For example, if I have three strings :

s1 = "How can i go from Bangalore to Chennai ?" ,   s2 = "How can i go from Chennai to Bangalore ?"

s3 = "How do i travel from Bangalore to Chennai ?"

Now if we use the number of common words divided by max(n, m) as a metric, then s3 would be equally similar to s1 and s2, i.e. 6/8 = 0.75. But if we use the LCS metric, then similarity(s3, s1) = 0.75, while similarity(s3, s2) = 4/8 = 0.5, which makes sense since the string s3 and s1 are semantically more similar than s3 and s2.

We can enhance the LCS similarity measure by taking into the contiguity of the longest common subsequence into account. What it means in simple terms, is that given three strings s1, s2 and s3, if the words in the LCS for (s3, s2) are more "closer" to each other than the words in LCS for (s3, s1), then s2 is probably a better match than s1.

For example, if :

s1 = "If I buy a HP laptop, can i get it in the price range of 30K to 35K" ,

s2 = "Which laptop is the best in the price range 30K to 35K"

s3 = "I wanted to buy a laptop, which is the best one in the price range 30K to 35K INR"

Both s1 and s2 have a LCS of length 11 words with s3 (shown in bold), but the LCS of s2 and s3 are much "closer" (in terms of word position) compared to s1 and s3.

In order to quantify the "contiguousness" of the LCS, extract the LCS and their positions from within the strings, and then count the contiguous sub-string lengths. For example in s1, the contiguous sub-strings are (shown in bold above) :

["I buy a",  "laptop",  "in the price range",  "30K to 35K"]

Similarly, the contiguous sub-strings in s2 are :

["Which",  "is the best in the price range 30K to 35K"]

The lengths of the contiguous sub-strings, for s1 : [3, 1, 4, 3] and s2 : [1, 10].

One way to compute a contiguousness metric is using the formula :

\sum_{i=1}^kw[i]*log(w[i]),

where w[i] is the length of a contiguous sub-string and k is the number of contiguous sub-strings

Thus for s1, we have, Contiguousness(s1) = 3*log(3) + 1*log(1) + 4*log(4) + 3*log(4) = 12.137

For s2, Contiguousness(s2) = 1*log(1) + 10*log(10) = 23.026

To compute the above metric of "contiguousness", one needs to compute the longest common subsequences and not only the length of the LCS (as shown above). Since there could be multiple LCS between two strings S1 and S2, compute the maximum "contiguousness" score among all LCS's.

Below is the python function, to return all possible longest common subsequences between two strings.

import collections
def lcs(str1, str2):
cached = collections.defaultdict(dict)
tokens1, tokens2 = str1.split(), str2.split()
for i in range(-1, len(tokens1)):
for j in range(-1, len(tokens2)):
if i == -1 or j == -1:
cached[i][j] = [[]]
else:
if tokens1[i] == tokens2[j]:
out = [x + [(tokens1[i], j)] for x in cached[i - 1][j - 1]]
else:
a, b = cached[i - 1][j], cached[i][j - 1]
if len(a[0]) == len(b[0]):
out = a + b if a[len(a) - 1] != b[len(b) - 1] else a
else:
out = a if len(a[0]) > len(b[0]) else b
cached[i][j] = out
return cached[len(tokens1) - 1][len(tokens2) - 1]

Then using the LCS's, we can write a function to compute the contiguousness of each LCS returned above and then finally return the maximum contiguousness score.

def seq_contiguity(subseqs):
score = 0
for subseq in subseqs:
seq = [y for x, y in subseq]
val, ent = 1, 0.
for idx in range(1, len(seq)):
if seq[idx] == seq[idx - 1] + 1:
val += 1
else:
ent += val * math.log(val)
val = 1
ent += val * math.log(val)
score = max(score, ent)
return score

This was just one possible enhancement to LCS.

In the next post, we shall look at Tree Kernels, which is similar in concept to LCS. Tree Kernels is a technique to find similarity between two strings and has been widely used in question-answer similarity matching.

Categories: AI, MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,