Stokastik

Machine Learning, AI and Programming

Leetcode : Longest Duplicate Substring

Problem Statement

Solution:

The brute force approach for this problem would be to scan for all substrings of length L for each possible value of L from 1 to N-1. Store the substrings in a set() and whenever we encounter a substring which is already present in the set then it is a duplicated substring of length L. Repeat this for all L until we find some L' for which there are no duplicate substring. The answer would be then L'-1.

Scanning the entire string for all substrings is O(N), because we scan N-L+1 positions in the string. Since we store all substrings of length L in the set, the maximum (memory) size of the set is O((N-L+1)*L). Since L=O(N), thus the time complexity of this approach is O(N2) and space complexity is also O(N2). For N=105, both time and space complexity would give "Time Limit Exceeded" and "Memory Limit Exceeded" errors respectively by online judge.

We can reduce time complexity by observing that if there is some length L' for which there are no duplicate substrings then for any length L > L', we cannot find any duplicate substring. This allows us to use binary search instead of linear search on length. Using binary search on length we can reduce the search time complexity to O(N*logN). But the space complexity is still O(N2). The python code for this enhancement is as follows:

class Solution(object):
    def has_dup(self, S, length):
        encounters = set()
        
        for i in range(len(S)-length+1):
            if S[i:i+length] in encounters:
                return S[i:i+length]
            encounters.add(S[i:i+length])

        return None
                        
    def longestDupSubstring(self, S):
        left, right = 1, len(S)-1
        res = ""
        
        while left <= right:
            mid = (left + right)/2
            out = self.has_dup(S, mid)
            if out is None:
                right = mid-1
            else:
                res = out
                left = mid+1
        
        return res

One way we can avoid storing all the substrings of length L in the set, is if we use some kind of hash to represent the substrings. Since the hashes would be of constant length thus the space complexity would then reduce to O(N). One such hash function could be the sum of the ASCII values of the letters in the substring, but it has a very high collision rate i.e. 'abcd' and 'dcab' would hash to the same value although they are not duplicates.

A better hash function should be able to consider the order of the letters in the substring. We can use python's built in 'hash()' method. But the issue with collision would still be there. One way we can tackle this is by using the substrings themselves as a second level filter. Thus using hash(), filter the starting indices of the substrings that hash to the same value and then for each hash value check if there are any duplicates.

Below is the Python code for hash() based implementation:

class Solution(object):
    def has_dup(self, S, length):
        encounters = dict()
        
        for i in range(len(S)-length+1):
            g = hash(S[i:i+length])
                
            if g not in encounters:
                encounters[g] = []
            encounters[g].append(i)
        
        for g, pos_list in encounters.items():
            if len(pos_list) > 1:
                v = set()
                for i in pos_list:
                    if S[i:i+length] in v:
                        return S[i:i+length]
                    v.add(S[i:i+length])
        return None
    
                        
    def longestDupSubstring(self, S):
        left, right = 1, len(S)-1
        res = ""
        
        while left <= right:
            mid = (left + right)/2
            out = self.has_dup(S, mid)
            if out is None:
                right = mid-1
            else:
                res = out
                left = mid+1
        
        return res

But now notice that the hash() method for a substring of length L, would have a time complexity of O(L). Also it has the disadvantage that the hash cannot be updated in O(1) time with each sliding window to the right. For each sliding window we have to re-compute the hash for the entire substring. Thus we need a hash function with the following properties:

  • Very low collision rate.
  • Should take the ordering of the letters into consideration.
  • Should be able to update in O(1) time for each shift in the sliding window.

One such hash function could be as follows:

  • For a substring of length L, concatenate the ASCII values of the letters into a single number.
    • For e.g. if substring is 'abcd' then the integer formed by concatenating the ASCII values would be 979899100 in base 10.
    • Note that this is a number and not a string.
  • Since the resulting number would be arbitrarily large, take modulo with some large prime number.
    • For example if p=32416190071, then 979899100 % p = 979899100
    • This is the hash value of the string 'abcd'.
    • The hash value would never exceed the prime 'p'.
  • For a sliding window to the right, we subtract the ASCII value of the 1st letter from the hash of last window and add the ASCII value of the new letter to get the hash for the current window.
    • All operations are modulo prime p
    • Operation is O(1) because we are just doing one subtraction and one addition.

Let us denote K[i] as the ASCII value of the letter S[i] in the string S. Then

H(S[i:j]) = (K[i]*dj-i + K[i+1]*dj-i-1 + ... + K[j]) mod p, where d is the base of the letters.

If the letters are from '0' to '9', then d=10, if the letters are from 'a' to 'z' then d=26.

H(S[i+1:j+1]) = (d*(H(S[i:j])-K[i]*dj-i-1) + K[j+1]) mod p

Because when we remove the letter S[i] from last window, we have to remove the multiplicative factor i.e. dj-i associated with S[i]. For e.g. to remove the factor of 1 from 123456, we have to subtract 100000=105 from 123456.

This hashing technique is known as the Rabin-Karp method of substring search.

We can pre-compute the hashes of all the prefixes H(S[0:i]) as follows:

d, q = len(set(S)), 32416190071

hashes = [0]*len(S)
for i in range(len(S)):
    if i == 0:
        hashes[0] = ord(S[0])
    else:
        hashes[i] = (d*hashes[i-1] + ord(S[i]))%q

Here we have taken 'd' to be the number of unique letters in the string. 'ord' method returns the ASCII value of the letter.

To compute the hash of H(S[i:j]) using pre-computed prefix hashes:

H(S[i:j]) = H(S[0:j]) - H(S[0:i-1])*dj-i

Thus the updated python code is as follows:

class Solution(object):
    def has_dup(self, S, length, hashes, d, q):
        encounters = dict()
        h, g = 1, 0
        
        for i in range(length): 
            h = (h*d)%q
        
        for i in range(len(S)-length+1):
            g = (hashes[i+length-1]-hashes[i-1]*h)%q if i > 0 else hashes[i+length-1]
            if g < 0:
                g += q
                
            if g not in encounters:
                encounters[g] = []
            encounters[g].append(i)
        
        out = None
        for g, pos_list in encounters.items():
            if len(pos_list) > 1:
                v = set()
                for i in pos_list:
                    if S[i:i+length] in v:
                        return S[i:i+length]
                    v.add(S[i:i+length])
        return None
    
                        
    def longestDupSubstring(self, S):
        if len(set(S)) == 1:
            return S[:len(S)-1]
        
        d, q = len(set(S)), 32416190071
        
        hashes = [0]*len(S)
        for i in range(len(S)):
            if i == 0:
                hashes[0] = ord(S[0])
            else:
                hashes[i] = (d*hashes[i-1] + ord(S[i]))%q
        
        left, right = 1, len(S)-1
        res = ""
        
        while left <= right:
            mid = (left + right)/2
            out = self.has_dup(S, mid, hashes, d, q)
            if out is None:
                right = mid-1
            else:
                res = out
                left = mid+1
        
        return res

Higher the collision rate of the hash function, greater is the time spent in doing the substring match. The collision rate of the above hash function is dependent on the prime number p. If prime p is very large then collision rate would be very low because the probability of two substrings hashing to the same value would reduce, when the range of the possible values i.e 0 to p-1 increases.

When the collision rate is very low, we achieve almost an O(N) space complexity. In-fact using p=263-1, we did not need to perform the exact substring search step and reduced the time from 3800 ms to 1470 ms on the online judge. But there is no guarantee that all future test cases will pass without exact substring search.

class Solution(object):
    def has_dup(self, S, length, hashes, d, q):
        encounters = set()
        h, g = 1, 0
        
        for i in range(length): 
            h = (h*d)%q
        
        for i in range(len(S)-length+1):
            g = (hashes[i+length-1]-hashes[i-1]*h)%q if i > 0 else hashes[i+length-1]
            if g < 0:
                g += q
                
            if g in encounters:
                return S[i:i+length]
            encounters.add(g)

        return None
    
                        
    def longestDupSubstring(self, S):
        if len(set(S)) == 1:
            return S[:len(S)-1]
        
        d, q = len(set(S)), 2**63-1
        
        hashes = [0]*len(S)
        for i in range(len(S)):
            if i == 0:
                hashes[0] = ord(S[0])
            else:
                hashes[i] = (d*hashes[i-1] + ord(S[i]))%q
        
        left, right = 1, len(S)-1
        res = ""
        
        while left <= right:
            mid = (left + right)/2
            out = self.has_dup(S, mid, hashes, d, q)
            if out is None:
                right = mid-1
            else:
                res = out
                left = mid+1
        
        return res

Categories: PROBLEM SOLVING

Tags: , , , , ,

Leave a Reply