Stokastik

Machine Learning, AI and Programming

LeetCode : Concatenated Words

Problem Statement

Solution :

This is an interesting problem, not because it is difficult or tricky to find an efficient solution, but there are multiple approaches and each approach is dependent on the problem definition and problem test cases. Identifying words where the space between them has been mistakenly omitted is a classic problem is text processing because this is one way by which search engines such as Google provide "Did You Mean ?" results.

Let's begin with an approach which we have seen earlier when we had explored spelling correction techniques. The idea is pretty simple. Given a string S of length m, represented as :

S = [x0, x1, x2, ..., xm-1], where xi is the character at position 'i' in the string.

Let us denote the sub-string from position i to position j as [i, j]. Then H([i, j]) = 1, if either the sub-string [i, j] is present in the valid set of words Q (our input to this problem) or it can be broken down into two or more words from the set Q. Checking whether [i, j] is present in Q is easy, but to check whether [i, j] can be broken down into two or more valid words from the set Q, we can iterate through all possible k, such that :

H[(i, j]) = 1 iff H([i, k]) = 1 and H([k + 1, j]) = 1 for any k such that i < k < j.

We can use dynamic programming to compute the values of H. The solution for a given set of words Q, are those words for which H([0, m-1]) = 1, but given the constraint of the problem that the string S must be composed of at-least two words from the set Q, it implies that there must be an index k, such that :

H([0, k]) = 1 and H([k+1, m-1]) = 1

Assuming that the average length of word is 'd', then the run-time complexity for computing H for a single word is O(d3). Thus the total run-time complexity for n words is O(nd3). Can we do better than this ?

Note that if it was required to identify all possible segmentation of the string that gives valid words from the set Q (as was required in our spelling correction problem), then this method is appropriate. But given that we need to identify whether it is composed of two or more words from the set Q, the definition changes a bit and with that we can alter our approach to reduce the time complexity.

One way is to recursively solve the problem. Let's say that for an index k, the substring [0, k] is present in Q, then recursively check whether the substring [k, n-1] is present in Q or it can be broken down into two or more strings.

i.e. define a recursive function F(s), which returns 1 if the sub-string [s, n-1] can be broken down into two or more parts, then :

F(s) = F(s + k) && [s, s + k - 1] is in Q for any k, s < k < n-1

Then the string is a valid concatenated string if F(0) = 1. We can do this using a depth first search (DFS).

For each possible starting position 's' in the string of length d, we need to search at-most 'd-s' positions for a valid word from the set Q and there are at-most O(d) starting positions in the string. Thus the time complexity of the DFS search for each string is O(d2). The total run-time complexity is O(nd2).

Can we reduce the time complexity of search if we had used a Trie data structure instead ?

Let's assume that we have constructed a prefix-tree out of all the words in the set Q. A prefix-tree is constructed in a way such that each node of the tree is associated with a character X such that X is the first character for all sub-strings in the sub-tree rooted at this node. Let's create a prefix-tree out of the following words :

['CAT', 'DOG', 'CATTLE', 'CATERPILLAR', 'DONKEY', 'DONALD', 'FOX']

A Prefix Tree

Each word in the list can be obtained by following one of the paths and concatenating the characters along that path. The green nodes represents leaf nodes, which implies that a valid word ends at this leaf node. Note that each node object holding one of the characters takes up certain memory in the RAM, and with many strings and very less number of common prefixes, the memory taken up can be large.

We can compress the above prefix tree by merging all nodes with a single child into a single node with the value for that node equal to the concatenation of the values for each of the nodes merged. Observe that the prefix tree property still holds good after this merging step.

A Compressed Prefix Tree

The prefix tree can be built efficiently by using a hash-map that maps from a key for a particular node to the node pointer object in the tree, so that we do not have to iterate over the nodes to find a node with a value that has a matching prefix with the string being inserted. The keys for the hash-map are the prefixes of the strings already in the tree.

Following is a Python class to construct a Prefix Tree :

import collections

class PrefixTree(object):
    def __init__(self, val):
        self.val = val
        self.children = []

    def add_word_to_tree(self, root, word, node_map):
        for pos in range(1, len(word) + 1):
            prefix = word[:pos]
            if prefix in node_map:
                root = node_map[prefix]

            else:
                node = PrefixTree(word[pos - 1])
                root.children.append(node)
                node_map[prefix] = node
                root = node

        node = PrefixTree('~')
        root.children.append(node)
        node_map[word + '~'] = node


    def compress_tree(self, root):
        queue = collections.deque([root])

        while len(queue) > 0:
            node = queue.popleft()

            if len(node.children) == 1 and node.children[0].val != '~':
                node.val += node.children[0].val

                new_children = []
                for child in node.children:
                    new_children += child.children

                node.children = new_children
                queue.append(node)

            else:
                for child in node.children:
                    queue.append(child)

With 'n' strings of an average length of 'd', the tree construction complexity is O(nd).

We can do a search over the prefix-tree to check if a word is composed of two or more words in Q or not in the following way :

For example, if the word is "cattledonald", then we check whether there is a node present in the tree with the key equal to the first letter of the word. If not present then the word cannot be present in the tree. Else get to the node with the correct prefix. In this case the character 'c' is present and it takes to the node with value 'cat' in the tree.

After we find that the prefix "cat" matches, then we go into the sub-tree rooted at the node "cat". But now the word to be searched for becomes "tledonald". The prefix "cat" is a valid word. Once we find a prefix that is also a valid word, we also search the tree for the suffix i.e. "tledonald".

But we have not yet finished exploring the sub-tree rooted at "cat". We find that the node "tle" matches with a prefix for "tledonald" and so we also search the sub-tree rooted at the node "tle", but with the sub-string "donald". We recursively parse the prefix-tree in this manner.

Following is the Python code for recursively searching the Prefix Tree with concatenated words :

def is_concatenated(self, word, original_word, prefix_tree, word_set):
    if word in word_set and word != original_word:
        return True

    if word[0] not in prefix_tree:
        return False

    else:
        out, pref = False, word[0]
        prefix = ''

        while pref in prefix_tree:
            prefix += prefix_tree[pref].val
            suffix = word[len(prefix):]

            if len(suffix) == 0:
                break

            if prefix in word_set and word[:len(prefix)] == prefix:
                out = out or self.is_concatenated(suffix, original_word, prefix_tree, word_set)

            pref = prefix + suffix[0]

        return out

Time complexity of searching for concatenated words given a string of length d is O(d). Thus total time complexity for searching is O(nd).

The entire Python code is as follows :

class Solution(object):
    def is_concatenated(self, word, original_word, prefix_tree, word_set):
        if word in word_set and word != original_word:
            return True

        if word[0] not in prefix_tree:
            return False

        else:
            out, pref = False, word[0]
            prefix = ''

            while pref in prefix_tree:
                prefix += prefix_tree[pref].val
                suffix = word[len(prefix):]

                if len(suffix) == 0:
                    break

                if prefix in word_set and word[:len(prefix)] == prefix:
                    out = out or self.is_concatenated(suffix, original_word, prefix_tree, word_set)

                pref = prefix + suffix[0]

            return out

    def findAllConcatenatedWordsInADict(self, words):
        words = [word for word in words if len(word) > 0]
        word_set = set(words)

        temp = []
        for word in words:
            for pos in range(1, len(word)):
                if word[:pos] in word_set and word[pos:] in word_set:
                    temp.append(word)
                    break

        words_set = word_set.difference(set(temp))
        out = []

        if len(word_set) > 0:
            node_map, prefix_tree = collections.defaultdict(), collections.defaultdict()
            tree = PrefixTree('')

            for word in words_set:
                tree.add_word_to_tree(tree, word, node_map)

            tree.compress_tree(tree)

            for word in words_set:
                if self.is_concatenated(word, word, node_map, word_set):
                    out.append(word)

        return out + temp

Note that initially we are trying to discover words that are concatenation of two words from the set of words, as it is easy and also efficient, O(nd), to find these words. Then with the remaining words, we create the prefix tree and then compress the prefix tree and then search for words with concatenation of more than two words.

Instead of finding words with concatenation of two words, before constructing the prefix tree, we can also build the prefix tree with all the words and then find the concatenated words (two or more). But there could be certain examples like :

["a", "aa", "aaa", "aaaa", "aaaaa", ...., "a"*10000]

which can be very efficiently handled with the first pass (concatenation of two words) with a best case complexity of O(n) only.

The worst case run-time complexity of the above algorithm is O(nd).

Categories: PROBLEM SOLVING

Tags: , , , ,