Stokastik

Machine Learning, AI and Programming

LeetCode : Unique Letters

Problem Statement

Solution :

Let's try to build a 'bad' solution first. By 'bad', I mean the approach may not be the most optimal but will return correct results every time. One such approach is to list down all possible substrings and count the unique letters in each of them and then take their sum. This approach is perfectly reasonable approach but why it is not optimal ?

For e.g. let our string S be represented as a n-tuple of letters :

[x0, x1, ..., xn-1], where each xi belongs to one of the letters from A to Z.

Then the number of possible sub-strings is = 0.5*n*(n-1) = O(n2), because the number of sub-strings of length 1 is 'n', the number of sub-strings of length 2 = n-1 and so on. Thus :

n + (n-1) + (n-2) + ... + 1 = 0.5*n*(n-1) = O(n2)

For each such substring, we count the number of unique letters. For a string of length 'd', the complexity of counting unique letters is O(d). Thus the total time complexity for all possible sub-strings is given by the sum :

S = 1*n + 2*(n-1) + 3*(n-2) + ... + n*1

The maximum value in the above summation, occurs for sub-strings of length n/2, which is (n/2)2 = O(n2).

Thus S <= n*(n/2)2 = O(n3).

Thus the above approach has an asymptotic run-time complexity of O(n3). Too bad for such a reasonable solution !!!

Can we do better than that ? Where can we optimize ? Can we somehow use dynamic programming technique to compute number of unique letters for each sub-string, so that each computation takes a constant O(1) time complexity instead of O(d) as seen earlier ?

Let us denote a sub-string 'xixi+1...xj' as [i, j].

Note that once we have computed the value of UNIQ([i, j]) for the sub-string [i, j], then in order to compute the unique letters for the sub-string [i, j + 1], we can use the fact that on addition of the extra letter 'xj+1' to the sub-string [i, j], the value of UNIQ([i, j]) can either increase, decrease or remain the same. i.e.

UNIQ([i, j + 1]) = UNIQ([i, j]) + 1, if the letter 'xj+1' is never seen in the sub-string [i, j]

UNIQ([i, j + 1]) = UNIQ([i, j]) - 1, if the letter 'xj+1' is seen once in the sub-string [i, j]

UNIQ([i, j + 1]) = UNIQ([i, j]) if the letter 'xj+1' is seen more than once in the sub-string [i, j], because they already cancel out.

Let us define an array H[i], which returns the last seen position of the character at index 'i'. If the character is not seen before index 'i', then H[i] = -1. Let's take an example.

Consider the sub-string 'ABCDBCB', then H[0]=H[1]=H[2]=H[3]=-1 because the first 4 letters 'ABCD' are unique and thus they do not occur anywhere before. But H[4]=1, because the letter at index 4, which is 'B' was also present at index 1, similarly H[5]=2 and H[6]=4, because last seen position of letter 'B' was at index 4 in the string. Using this data structure, we can determine UNIQ([i, j + 1]) from UNIQ([i, j]) in constant time.

i.e. if H[j+1] < i, then the letter 'xj+1' is never seen in the sub-string [i, j], and thus UNIQ([i, j + 1]) = UNIQ([i, j]) + 1

But if H[j+1] >= i, there are two possibilities, one that the letter 'xj+1' is seen only once in the sub-string [i, j] or it is seen more than once. To verify whether it is seen more than once in the sub-string [i, j], just check whether H[H[j+1]] < i or not.

If H[j+1] >= i and H[H[j+1]] < i, then the letter 'xj+1' is seen only once in the sub-string [i, j], and thus UNIQ([i, j + 1]) = UNIQ([i, j]) - 1, else UNIQ([i, j + 1]) = UNIQ([i, j])

Now for each sub-string, the time complexity for UNIQ is O(1), achieved through dynamic programming. Thus the total time complexity is :

S = n + (n-1) + (n-2) + ... + 1 = O(n2), instead of O(n3) by the earlier approach.

Can we still do better ? Let's try.

Observe that we can break down the summation of the unique letters for all sub-strings as the summation of unique letters for all sub-strings ending with each of the letters, i.e. assuming that our string is 'x0x1x2', then the summation of unique letters for all sub-strings is given by :

Q = UNIQ(x0) + UNIQ(x1) + UNIQ(x2) + UNIQ(x0x1) + UNIQ(x1x2) + UNIQ(x0x1x2)

Let G(xi) be the summation of unique letters for sub-strings ending with the letter at the i-th position :

G(x0) = UNIQ(x0)

G(x1) = UNIQ(x1) + UNIQ(x0x1)

G(x2) = UNIQ(x2) + UNIQ(x1x2) + UNIQ(x0x1x2)

Then we can write Q = G(x0) + G(x1) + G(x2).

Observe that each G(xi) is obtained by adding 'xi' at the end of each of the sub-strings in G(xi-1) and then taking the UNIQ function over each, plus a fixed term UNIQ(xi) which is anyway equals to 1.

Let's assume that we have computed all G's till G(xi-1) and we need to compute G(xi). Now if 'xi' has never been seen in [0, i - 1], then addition of 'xi' to each of the sub-strings in G(xi-1), i.e.

[0, i - 1], [1, i - 1], [2, i - 1], ..., [i - 1, i - 1]

will contribute an additional factor of 1 for each of the terms, i.e. UNIQ([0, i]) = UNIQ([0, i - 1]) + 1, UNIQ([1, i]) = UNIQ([1, i - 1]) + 1, and so on.

Thus in such a case : G(xi) = G(xi-1) + i + 1, because there are i terms in G(xi-1) and +1 due to UNIQ(xi).

What if 'xi' is already present in [0, i - 1] ? Then as seen earlier, there could be two scenarios :

  1. 'xi' is seen only once in [0, i - 1]
  2. 'xi' is seen more than once in [0, i - 1]

The first case is valid if H[i] >=0 and H[H[i]] = -1 and the second case is valid if both H[i] >=0 and H[H[i]] >= 0.

Assuming the first case is true, let p = H[i], then addition of 'xi' to each of the sub-strings :

[p + 1, i - 1], [p + 2, i - 1], [p + 3, i - 1], ..., [i - 1, i - 1]

contributes an addition of +1 similar to before because 'xi' is never seen after the index 'p' according to the definition of H. The number of such terms is (i - p - 1).

For the remaining terms, i.e.

[0, i - 1], [1, i - 1], [2, i - 1], ..., [p, i - 1]

Addition of 'xi', will decrease the number of unique letters by 1, because now 'xi' is not unique anymore. The number of such terms is (p + 1).

Thus in such a case :

G(xi) = G(xi-1) + (i - p - 1)*1 + (p + 1)*(-1) + 1 = G(xi-1) + i - 2p - 1

Assuming the second case is true, i.e. both H[i] >=0 and H[H[i]] >= 0, let :

p = H[i] and q = H[H[i]] = H[p], thus quite obviously q < p.

In such a case, the terms contributing +1 remains the same as the above case and the number of such terms is (i - p - 1). For the terms contributing (-1), now instead of all substrings starting from 0th index, we will have only the sub-strings starting from the (q+1)th index, i.e. The terms contributing (-1) are :

[q + 1, i - 1], [q + 2, i - 1], [q + 3, i - 1], ..., [p, i - 1]

The number of such terms is (p - q).

Now, for the remaining terms, i.e.

[0, i - 1], [1, i - 1], [2, i - 1], ..., [q, i - 1]

They do not contribute anything at all on addition of 'xi', because 'xi' is already a non-unique in [0, i - 1]. The number of such terms is (q + 1). Thus we can write the expression for G(xi) as :

G(xi) = G(xi-1) + (i - p - 1)*1 + (p - q)*(-1) + (q + 1)*0 + 1 = G(xi-1) + i - 2p + q

Thus now instead of enumerating all possible sub-strings, we can use dynamic programming to derive G(xi) from G(xi-1) in constant time O(1) and then we can compute UNIQ([0, n - 1]) in linear time as :

UNIQ([0, n - 1]) = G(x0) + G(x1) + G(x2) + ... + G(xn-1).

Although the logic described above looks a bit complicated (especially given that we are doing two levels of dynamic programming), but the Python code turns out to be surprisingly easy.

import collections

class Solution(object):
    def uniqueLetterString(self, S):
        if len(S) == 0:
            return 0

        chr_pos_before, last_char_pos = [-1]*len(S), collections.defaultdict(int)

        for idx in range(len(S)):
            char = S[idx]

            if char in last_char_pos:
                chr_pos_before[idx] = last_char_pos[char]
                last_char_pos[char] = idx
            else:
                last_char_pos[char] = idx

        unique, res = [0]*len(S), 0

        for end in range(len(S)):
            if end == 0:
                unique[end] = 1
            else:
                count = unique[end - 1]

                last_pos = chr_pos_before[end]

                if last_pos == -1:
                    count += end

                else:
                    if chr_pos_before[last_pos] >= 0:
                        last_last_pos = chr_pos_before[last_pos]
                        count += end - 2 * last_pos + last_last_pos
                    else:
                        count += end - 2 * last_pos - 1

                unique[end] = count

            res += unique[end]

        return res

The run-time complexity of the above code is O(n), where n is the length of the original string.

Categories: PROBLEM SOLVING

Tags: , ,