Stokastik

Machine Learning, AI and Programming

LeetCode : Maximum Gap

Problem Statement

Solution :

There were many different approaches that were coming to my mind while beginning to solve the problem, but all of them were non-linear space/time complexities. One thing that I realized was that, if there was some generic O(N) approach that enables us to find difference between any two consecutive numbers in the sorted array, then sorting the numbers itself would be linear. But that is not the case as we know. For example, consider adding the numbers to a min-heap. Although "heapifying" the integers is O(N), but computing the successive differences requires popping out the root of the heap each time (before taking the difference with the minimum of its child nodes), which is O(N*logN).

Luckily the numbers are 32-bit integers, which means that we can look at some integer sorting algorithms that has an average time complexity of O(N). For example, counting sort, bucket sort or radix sort might come in handy while solving this problem. The problem with counting sort is that, for 32-bit positive integers, the range of numbers is [0, 231 - 1], which means that I need to keep an auxiliary array of size 231 irrespective of whether my input array has only 2 elements or 231 elements. The space complexity is not linear in the size of the array.

Bucket sort is a generalization of the counting sort algorithm. The entire range of numbers in counting sort i.e. [0, 231 - 1] is projected onto a smaller range of size N, by using some kind of hashing technique, thus creating N buckets in which to put our N numbers.

Hashing constraint : H(m) < H(n) if m < n

The hashing algorithm might not be perfect and more than one number might be put inside the same bucket while some buckets might be empty. If a bucket contains more than one element, then the numbers in that bucket is sorted separately, using QuickSort. If the hashing is perfect and each bucket contains one element, then the run-time is O(N), while in the worst case where the hashing is bad, all numbers might go into the same bucket and thus run time is same as the worst case of QuickSort for N numbers.

Bucket Sort (Source : Wikipedia)

Thus the worst case complexity of Bucket Sort is same as the worst case complexity of QuickSort i.e. O(N2). Can we use the bucket sort logic to compute the maximum gap without requiring to sort the integers, in O(N) complexity ?

Given an array A of N, 32-bit integers, the maximum gap between any two integers in the sorted array would be at-least :

gap = \text{ceiling}(\frac{\text{max}(A)\;+\;\text{min}(A)}{N})

For, example, if an array contains the following numbers :

A = [23, 78, 12, 100, 45, 6, 99]

Then min(A) = 6 and max(A) = 100 and N = 7. Then the maximum gap between any two consecutive integers in the sorted array should be at-least 106/7 = 16.

The sorted array A :

sorted A = [6, 12, 23, 45, 78, 99, 100]

The maximum gap is 33 (78 - 45), which is greater than 16. If the numbers in the array are uniformly distributed in the same range [min, max] = [6, 100], such as in :

sorted A' = [6, 21, 37, 52, 68, 84, 100]

The maximum difference between any two consecutive numbers is 16. If we reduce or increase any of the numbers between 6 and 100, the maximum difference would either remain same (16) or would increase, but it would never decrease. Verify that.

Now let's say that we distribute the numbers into N buckets. The hashing function used is :

H(n) = \text{int}(\frac{n\;-\;\text{min}(A)}{\text{gap}}), where 'gap' is defined as above

The above function, assigns each number to one of the N buckets.

In case the numbers are uniformly distributed between min(A) and max(A), then each bucket will contain only one number and we can iterate over the buckets and compute the maximum difference easily. But suppose the numbers are not uniformly distributed, then some buckets will contain more than one number and some bucket will be empty.

Do we need to sort and scan each bucket, in case it contains more than one element ?

Observe that numbers which are "close" to each other, would go into the same bucket, meaning that the difference between consecutive numbers in sorted form would be very small. In fact the maximum difference between any two numbers in that bucket will always be less than 'gap', i.e. the maximum difference of numbers in each bucket will never give the correct answer.

For example, consider the array A = [19, 6, 10000, 14, 15, 7, 11], then the distribution of the numbers into buckets is very skewed.

The sub-array [6, 7, 11, 14, 15, 19] will go into the 1st bucket whereas [10000] will go into the last bucket. But none of the consecutive number difference in the first bucket is at-least (6 + 10000)/7 = 1430 (maximum is 4).

Thus we can never find the maximum gap in the first bucket.

Thus we do not need to sort and scan each bucket separately but we only need to track the minimum and maximum number in each bucket.

To compute the maximum gap, we only need to scan each bucket in the order and take the difference between the maximum number in one bucket with the minimum number in the next bucket.

Following is the code in python :

import math

class Solution(object):
    def maximumGap(self, nums):
        if len(nums) < 2:
            return 0

        min_num, max_num = float("Inf"), -float("Inf")

        for num in nums:
            min_num = min(min_num, num)
            max_num = max(max_num, num)

        num_buckets = len(nums)

        min_diff = int(math.ceil(float(min_num + max_num) / num_buckets))

        buckets = [(float("Inf"), -float("Inf"))] * num_buckets

        for num in nums:
            bucket_num = int((num - min_num) / min_diff)
            w = buckets[bucket_num]
            a, b = min(w[0], num), max(w[1], num)
            buckets[bucket_num] = (a, b)

        max_diff = buckets[0][1] - buckets[0][0]

        last = 0

        for idx in range(1, len(buckets)):
            if buckets[idx][0] != float("Inf"):
                max_diff = max(max_diff, buckets[idx][0] - buckets[last][1])
                last = idx

        return max_diff

Note that some buckets might be empty and we skip those buckets in the last for loop.

The run time and space complexity is O(N).

Categories: PROBLEM SOLVING

Tags: , , , ,