Solution:

One naive approach is to estimate the value of M from the fact that the maximum value of M possible is:

Q=(n1*len(s1))/(n2*len(s2))

Let s^{(x)} denote the string s repeated 'x' times.

Using binary search between 0 to Q, check if the string s2^{(n2*m)} is present in the string s1^{(n1) }for some m. If s2^{(n2*m)} is present in s1^{(n1)}, then search to the right half of m else to the left half of m. Continue searching until we find some M for which any value m' > M, s2^{(n2*m')} is NOT present in s1^{(n1)}. Searching for whether the string s2^{(n2*m)} is present in s1^{(n1)} is of the order O(n1*len(s1)) because in the worst case we need to scan the entire string s1^{(n1)}, which is of length n1*len(s1). The overall run-time is thus (factoring in the time to do binary search):

O(n1*len(s1)*log(Q))

If we denote n1*len(s1) by N, then the run-time is O(N*logN). Since N is of the order of 10^{8}, the run-time can be really huge and so this method is not effecient at all. Can we come up with an effecient approach ?

One of the question that we should ask is that, can we compute how many times the string s2 occurs in the string s1^{(n1)} without needing to scan the entire string. One observation is that if the string s2 occurs between the indices [i, j] in s1^{(n1)} then there exists some 'p' for which s2 will also occur between the indices [i+p, j+p], [i+2p, j+2p], ... [i+kp, j+kp] in s1^{(n1)} where j + kp <= n1*len(s1)-1 for some factor 'k'.

'p' is known as the period. The period 'p' is in-fact multiple of len(s1). Thus if we find two occurrences of s2 such that they start/end at the same character position in s1, then we have found the period 'p' because (j + kp) mod len(s1) = j mod len(s1), since p is a multiple of len(s1).

For e.g. if s1='nlhqgllunmelayl' and n1=1000 and s2='lnl', then one occurrence of s2 would be [6,11], but observe that s2 will also occur at [21,26], [36,41] ... and so on. Thus period 'p' is 15 = len(s1).

Also if there are more than one occurrences of s2 between [i, j] and [i+p, j+p] then these occurrences will also have a period of 'p'. Observe that the maximum value of the period 'p' is len(s1)*len(s2), because the maximum number of occurrences of s1 required for s2 to be present in it is len(s1)*len(s2)

For e.g. s1="aaabaa" and s2="bbb", then we need al-least 3 occurrences of s1 for s2 to be present, if at all s2 should be present.

To find the number of occurrences of s2 in s1^{(n1)} , find the minimum value of 'p' first in O(len(s1)*len(s2)) time complexity and then for all j' between j and j+p, find the respective k', such that j'+k'p <= n1*len(s1)-1:

k' = (n1*len(s1)-1-j')/p

Add up all these k' to get the number of times s2 occurs in s1^{(n1)}. The number of such k's would be O(len(s1)).

The python code for the solution is as follows:

class Solution(object): def get_period(self, s1, s2): end_positions, end_pos_map = [], dict() i, j = 0, 0 n, m = len(s1), len(s2) period = 0 while True: if s1[i%n] == s2[j%m]: i += 1 j += 1 if j % m == 0: if (i-1) % n in end_pos_map: period = i-1-end_pos_map[(i-1)%n] break end_positions.append(i-1) end_pos_map[(i-1)%n] = i-1 else: i += 1 return end_positions, period def getMaxRepetitions(self, s1, n1, s2, n2): if n1 == 0 or len(set(s2).difference(set(s1))) > 0: return 0 positions, period = self.get_period(s1, s2) max_pos, sums = n1*len(s1)-1, len(positions) for x in positions: sums += (max_pos-x)/period return sums/n2

The complexity of this approach is O(len(s1)*len(s2)), which is much better than what we had got earlier because len(s1) or len(s2) is of O(10^{2}) only.

- Building a classification pipeline with C++11, Cython and Scikit-Learn
- Leetcode : Longest Valid Parentheses

Categories: PROBLEM SOLVING

Tags: Count Repetitions, Dynamic Programming, Leetcode, Modulus, Period, String Distance