Stokastik

Machine Learning, AI and Programming

Sorting numbers in parallel on GPU - Bitonic Sort

Sorting is one of the most common use cases in programming. There are lots of different effecient algorithms existing for sorting and most built-in libraries for specific programming languages have implemented them in a time effecient manner. Merge Sort, Quick Sort, Heap Sort are generic algorithms for sorting real numbers. They have an average-case run-time complexity of O(N*logN) where we are sorting N real numbers. Each of them has a specific use case depending on the size of data, memory available etc. Whereas for integers, there are O(N) time complexity available such as Radix Sort and Counting Sort.

In this post we look at another variant of sorting known as the Bitonic Sort. Although Bitonic Sort is O(N*log2N) complexity algorithm, but the algorithm can be easily parallelized using threads in CPU or implementing in CUDA for GPU. With parallel implementation on GPU, we can create an O(log2N) complexity algorithm for Bitonic Sort. Before going into the GPU implementation, we explain a bit about how Bitonic Sort works:

A sequence a0, a1, ..., an-1 of length n is a bitonic sequence, if it is initially increasing then starts decreasing or initially decreasing then starts to increase, i.e. there is some i, for which a0 < a1 < ... < ai and ai > ai+1 > ... > an-1 or a0 > a1 > ... > ai and ai < ai+1 < ... < an-1. For e.g. 2, 4, 9, 7, 1 is a bitonic sequence. A bitonic sequence could also wrap over. For e.g. 2, 4, 9, 5, 0, 1. Although it increases and then decreases and then again increases but if you start the sequence from the 5th position i.e. 0 and wrap over from the end to the start again, it will be an increasing then decreasing sequence.

Given a bitonic sequence of length 2n, if we partition it into two equal parts of length n each say, A and B, then for each element ai in A, if we compare with the corresponding element bi in B and swap them if ai > bi then A and B will individually be bitonic sequences and A will contain the smallest n numbers while B will contain the highest n numbers.

Let's say that our bitonic sequence of length 8 be [2, 5, 7, 6, 4, 3, 1, 0].

Then if we divide the sequence into 2 equal parts, we obtain A = [2, 5, 7, 6] and B = [4, 3, 1, 0]. Now if we swap the numbers in A with B if ai > bi then we obtain A = [2, 3, 1, 0] and B = [4, 5, 7, 6]. Observe that both A and B are bitonic after the swaps and also A contains the smallest 4 numbers and B the highest 4 numbers.

Recursively doing division and swap on A and B separately we obtain a sorted array of 8 numbers eventually.

Sorting a bitonic sequence in increasing order

But the above steps for sorting an array works only if the initial array given is a bitonic sequence. But our input array could be anything, not necessarily a bitonic sequence. Given two arrays C and D of lengths n each, we can obtain a bitonic sequence of length 2n if we sort the C in increasing order and D in decreasing order or the opposite.

Thus to obtain sorted array of length 2n we need a bitonic sequence of length 2n but to obtain a bitonic sequence of length 2n we need two sorted arrays of length n each. This is again a recursive process.

Thus to sort an array of length 8:

  • Starting with sorted arrays of length 1 (there are 8 sorted arrays of length 1).
  • Create bitonic sequences of length 2 each (there would be 4 bitonic sequences of length 2).
    • For e.g. if initial array is [3,1,5,7,6,0,9,8], then theoretically there are 8 sorted arrays of length 1.
    • Merge consecutive elements to get length 2 bitonic sequences: [[3,1], [5,7], [6,0], [9,8]]
    • [3,1] is a bitonic sequence of length 2, so are [5,7], [6,0] and [9,8].
  • Create sorted arrays of length 2 from the above bitonic sequences (there would be 4 sorted arrays of length 2).
    • But the arrays are sorted in such a way that alternate sorted arrays are increasing then decreasing or decreasing then increasing.
    • With 4 sorted arrays of length 2 each it would be : increasing (0, 1) -> decreasing (2, 3) -> increasing (4, 5) -> decreasing (6, 7)
    • [[1,3], [7,5], [0,6], [9,8]] is the outcome from last step.
  • From 4 sorted arrays of length 2 each, create 2 bitonic sequences of length 4 each.
    • Again merge consecutive arrays: [[1,3,7,5], [0,6,9,8]]
    • [1,3,7,5] is a 4 length bitonic sequence so is [0,6,9,8].
  • Create sorted arrays of length 4 each from the above bitonic sequences (there would be 2 sorted arrays of length 4).
    • [[1,3,5,7], [9,8,6,0]] is the outcome from last step.
  • From 2 sorted arrays of length 4 each, create 1 bitonic sequence of length 8.
    • Again merge consecutive arrays: [[1,3,5,7,9,8,6,0]]
    • [1,3,5,7,9,8,6,0] is a length 8 bitonic sequence.
  • Create 1 sorted array of length 8 from the above bitonic sequence.
    • [0,1,3,5,6,7,8,9] is the final outcome from last step.

To analyze the running time complexity of the above algorithm, note that to create a sorted array of length n from a bitonic sequence of length n (one sorted array of length n/2 and another sorted array in opposite order of length n/2).

Let P(n) be the time complexity of obtaining a sorted array of length n from a bitonic sequence of length n :

P(n) = 2*P(n/2) + O(n), O(n) complexity for swapping the elements of the sub-arrays.

P(n) = O(n*logn)

For a generic array of n numbers, let T(n) be the time complexity of sorting the n numbers, then we write:

T(n) = 2*T(n/2) + P(n)

because we need to obtain two sorted arrays of length n/2 each (one in increasing order and the other in decreasing order) which is same as obtaining a bitonic sequence of length n. Merging two sorted arrays into a single sorted array using bitonic sort is P(n) as above.

Thus T(n) = O(n*log2n)

Think of it as a binary tree with n nodes having O(logn) levels and each node of the binary tree is also a binary tree with number of nodes equal to the number of nodes in the subtree rooted at that node.

Bitonic Sort Schematic Diagram (For 16 numbers, there are 4 stages and each stage has 1,2,3,4 sub-stages respectively).

For this post we will assume that we are working with arrays of sizes that are powers of 2.

Here is a simple C++11 program that does bitonic sort for a generic array of integers:

#include <iostream>
#include <math.h>
#include <time.h>
#include <stdlib.h>
#include <random>
#include <vector>
#include <chrono>
#include <deque>
#include <algorithm>
#include <iterator>

bool check_correctness(int *orig_arr, int *arr_sorted, const int n) {
    std::sort(orig_arr, orig_arr+n);
    
    for (int i = 0; i < n; i++) {
        if (orig_arr[i] != arr_sorted[i]) {
            return false;
        }
    }
    return true;
}

void swap(int *arr, const int skip, const int oflag, const int order, const int n) {
    for (int i = 0; i < n/2; i++) {
        int j = i*2 - i%skip;
        int k = j + skip;
        
        int f = ((int) (i/oflag)) % 2 == 0 ? order:-order;
        
        int x, y;
        
        if (f == 1) {
            x = arr[j] < arr[k] ? arr[j]:arr[k];
            y = arr[j] > arr[k] ? arr[j]:arr[k];
        }
        else {
            x = arr[j] > arr[k] ? arr[j]:arr[k];
            y = arr[j] < arr[k] ? arr[j]:arr[k];
        }
        
        arr[j] = x;
        arr[k] = y;
    }
}

void bitonic_sort(int *arr, const int order, const int n) {
    int skip = 1;
    
    while (skip < n) {
        int k = skip;
        
        while (k > 0) {
            swap(arr, k, skip, order, n);
            k /= 2;
        }
            
        skip *= 2;
    }
}

void random_vector(int *arr, const int n, const int min_val=0.0, const int max_val=1000.0) {
    static std::random_device rd;
    static std::mt19937 mte(rd());
    std::uniform_int_distribution<int> dist(min_val, max_val);
    
    for (int i = 0; i < n; i++) {
        arr[i] = dist(mte);
    }
}

int main(void) {
    int n = 1 << 25;
    
    int *arr, *temp;
    arr = new int[n];
    
    random_vector(arr, n, 0, 10000);
    
    temp = new int[n];
    std::copy(arr, arr+n, temp);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    bitonic_sort(arr, 1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << check_correctness(temp, arr, n) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    
    duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    return 0;
}

Note that we could have also written a more top-down recursive approach for solving the above problem but we have deliberately chosen a non-recursive bottom up approach because we will be parallelizing the same swap method above using threads as well as on GPU using CUDA.

We are sorting 225 (approx. 33 million) integers selected uniform randomly from the range 0 to 10000. The for loop inside the swap method is very carefully written after several hours of thinking of how to come up with the indices to swap that is a function of the looping index i.

When the same code is written in C++11 using threads:

#include <thread>

void swap_thread(int *arr, const int skip, const int oflag, const int order, const int start, const int end) {
    for (int i = start; i < end; i++) {
        int j = i*2 - i%skip;
        int k = j + skip;
        
        int f = ((int) (i/oflag)) % 2 == 0 ? order:-order;
        
        int x, y;
        
        if (f == 1) {
            x = arr[j] < arr[k] ? arr[j]:arr[k];
            y = arr[j] > arr[k] ? arr[j]:arr[k];
        }
        else {
            x = arr[j] > arr[k] ? arr[j]:arr[k];
            y = arr[j] < arr[k] ? arr[j]:arr[k];
        }
        
        arr[j] = x;
        arr[k] = y;
    }
}


void swap(int* arr, const int skip, const int oflag, const int order, const int n) {
    int n_threads = 1 << 5;
    std::vector<std::thread> threads(n_threads);
    
    int batch_size = n/(2*n_threads);

    for (int i = 0; i < n_threads; i++) {
        int start = batch_size*i;
        int end = std::min(batch_size*(i+1), n/2);
        threads[i] = std::thread(swap_thread, arr, skip, oflag, order, start, end);
    }

    for (int i = 0; i < n_threads; i++) {
        threads[i].join();
    }
}

void bitonic_sort(int *arr, const int order, const int n) {
    int skip = 1;
    
    while (skip < n) {
        int k = skip;
        
        while (k > 0) {
            swap(arr, k, skip, order, n);
            k /= 2;
        }
            
        skip *= 2;
    }
}

int main(void) {
    int n = 1 << 25;
    
    int *arr, *temp;
    arr = new int[n];
    
    random_vector(arr, n, 0, 10000);
    
    temp = new int[n];
    std::copy(arr, arr+n, temp);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    bitonic_sort(arr, 1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << check_correctness(temp, arr, n) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    
    duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    return 0;
}

Care must be taken not to choose too many number of threads because creating and destroying threads is an expensive operation. Thus we are limiting the number of threads to 32 for 225 integers.

We have experimented with different thread counts and found that 32 is giving optimal performance.

Finally we come to the GPU implementation with CUDA and C++11. If you are not aware of the CUDA syntax, I would recommend you to go through my previous blog-post for reference:

#define BLOCK_SIZE 1024

__global__ void swap(int *arr, const int skip, const int oflag, const int order, const int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (i < n/2) {
        int j = i*2 - i%skip;
        int k = j + skip;

        int f = ((int) (i/oflag)) % 2 == 0 ? order:-order;

        int x, y;

        if (f == 1) {
            x = arr[j] < arr[k] ? arr[j]:arr[k];
            y = arr[j] > arr[k] ? arr[j]:arr[k];
        }
        else {
            x = arr[j] > arr[k] ? arr[j]:arr[k];
            y = arr[j] < arr[k] ? arr[j]:arr[k];
        }

        arr[j] = x;
        arr[k] = y;
    }
}

void bitonic_sort(int *arr, const int order, const int n) {
    int skip = 1;
    
    while (skip < n) {
        int k = skip;
        
        while (k > 0) {
            swap<<<(n/2 + BLOCK_SIZE - 1)/BLOCK_SIZE, BLOCK_SIZE>>>(arr, k, skip, order, n);
            cudaDeviceSynchronize();
            
            k /= 2;
        }
            
        skip *= 2;
    }
}

int main(void) {
    int n = 1 << 25;
    
    int *arr, *temp;
    cudaMallocManaged(&arr, n*sizeof(int));
    
    random_vector(arr, n, 0, 10000);
    
    temp = new int[n];
    std::copy(arr, arr+n, temp);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    bitonic_sort(arr, 1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << check_correctness(temp, arr, n) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    
    duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    cudaFree(arr);

    return 0;
}

Note that there are minimal changes from the C++ single threaded CPU implementation. Major changes are that:

  • The method 'swap' is now a kernel function because this will now run on GPU.
  • Instead of the for-loop in the swap method, we are using the thread index as the loop iterator.
    • Thus all swaps in a step are done in an embarrassingly parallel manner requiring only O(1) steps as compared to O(N) steps for N integers.
    • Thus the running time of the bitonic sort is reduced from O(N*log2N) to O(log2N).

Here are the final comparison numbers for different approaches of sorting. Observe that we are also including numbers for python implementation to show how fast is C++ in general as compared to python:

Approach Time Taken (in ms)
Python built-in sort 20552
C++11 built in sort 8278
C++11 single threaded Bitonic Sort 81840
C++11 multi threaded (32 threads) Bitonic Sort 12350
NVIDIA Tesla P100 CUDA implementation 245

Thus as expected our GPU implementation performs the best. C++11 built in sort is faster than Python's built in sort for 33 million integers by about 2.5 times. Interestingly C++ single threaded bitonic sort implementation on CPU is much slower than even Python's built-in sort. But the C++11 multi-threaded implementation of bitonic sort is faster than the Python's built-in sort algorithm.

Thus the bottom line is that we are now able to sort 33 million integers in only 250 milliseconds (about 35x faster than built-in CPU implementation). That's quite incredible !!!

Important : The above codes only works when the array size is a power of 2. If not we can always either pad them to make array size a power of 2 or break them into multiple parts each with sizes powers of 2 and sort them in parallel and then merge the sorted arrays in O(n) complexity.

All the above codes are shared on my Github repository.

References:

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,

Leave a Reply