Stokastik

Machine Learning, AI and Programming

Constructing Binary Search Trees on the GPU

Binary Search Trees (BST) are one of the very important data structures required for effecient searching (lookup) and sorting problems. In machine learning, BSTs or variants of BSTs are used widely for various tasks such as fast nearest neighbor lookups (KD-Trees), Decision Trees etc. BSTs have an average case time complexity of O(logN) for searching where N is the number of possible values to search, but can go upto O(N) in the worst case. When the tree is balanced such as in AVL Trees or Red-Black Trees, we can achieve O(logN) complexity for search and insert operations in the worst case.

Constructing standard BSTs is an easy to solve problem. They can be constructed recursively as following:

  • If a value is less than or equal to the value at the current node, then go towards the left sub-tree.
  • If a value is greater than the value at the current node, then go towards the right sub-tree.

The following C++ program can be used to generate a binary search tree from an array of numbers. For this exercise we are choosing 225 ~ 33 million, uniformly sampled real numbers between 0 and 10K.

#include <iostream>
#include <math.h>
#include <time.h>
#include <stdlib.h>
#include <random>
#include <vector>
#include <chrono>
#include <deque>
#include <algorithm>
#include <iterator>
#include <set>
#include <tuple>
#include <deque>

void random_vector(float *arr, const int n, const float min_val=0.0, const float max_val=1000.0) {
    static std::random_device rd;
    static std::mt19937 mte(rd());
    std::uniform_real_distribution<float> dist(min_val, max_val);
    
    for (int i = 0; i < n; i++) {
        arr[i] = dist(mte);
    }
}

struct bstree {
    float val;
    int index;
    bstree *left;
    bstree *right;
};

bstree *insert_into_tree(bstree *tree, float val, int index) {
    if (tree == NULL) {
        tree = (bstree*)malloc(sizeof(bstree));
        
        tree->val = val;
        tree->index = index;
        tree->left = NULL;
        tree->right = NULL;
    }
    else {
        if (val <= tree->val) {
            tree->left = insert_into_tree(tree->left, val, index);
        }
        else {
            tree->right = insert_into_tree(tree->right, val, index);
        }
    }
    return tree;
}

bstree *bs_tree(float *arr, const int n) {
    bstree *tree = NULL;
    for (int i = 0; i < n; i++) {
        tree = insert_into_tree(tree, arr[i], i);
    }
    
    return tree;
}
    
int main(void) {
    int n = 1 << 25;
    
    float *arr, *temp;
    arr = new float[n];
    
    random_vector(arr, n, 0, 10000);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    bstree *tree = bs_tree(arr, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;

    return 0;
}

The above tree construction algorithm is the simplest tree construction algorithm because it does not guarantee that the tree is balanced. If the initial set of numbers are sorted then the tree constructed is a linear structure and search and insert operation would be O(N). Thus we are selecting the numbers from a random distribution. For a balanced tree, please refer to this code in Python for AVL Tree.

Constructing a BST with N real numbers takes O(N*logN) in the average case. If instead of real numbers we had used integers in the above code then the complexity would be higher or almost O(N2) in the worst case. This is because N=225 and the integers in the array would be in the range [0,103]. Thus each integer would be repeated on an average N/103 times. When numbers are repeated many times, the tree structure becomes almost linear.

For 33 million real numbers constructing the BST takes around 63 seconds on an Azure server.

Now lets' try to implement the BST construction algorithm on a GPU using CUDA and C++. If we look at the above C++ approach, then it is clear that the algorithm for tree construction is not parallelizable. But using auxiliary arrays, we can convert the BST construction into a parallel algorithm. For this conversion, we use 3 auxiliary arrays:

  • left_child
  • right_child
  • parent

The array 'left_child' stores index of the left child. For e.g. if the current root index is 2, then left_child[2] will contain the index of the left-child of 2. If an index 'i' is a leaf node then left_child[i] would be -1.

Similarly the array 'right_child' stores the index of the right child. For e.g. if the current root index is 5, then right_child[5] will contain the index of the right-child of 5. If an index 'i' is a leaf node then right_child[i] would be -1.

The array parent[i] contains the index of the parent for node i. If an index 'i' is the root node then parent[i] would be -1.

BST with left_child, right_child and parent arrays.

The tree construction algorithm proceeds as follows:

  1. At each level 'k' of the tree, iterate over all the indices of numbers in the array.
  2. For each index 'i' in the array, check if arr[i] <= arr[parentk-1[i]].
    • If it is true, then assign the parent of 'i' to left_childk-1[parentk-1[i]], i.e. parentk[i] = left_childk-1[parentk-1[i]].
    • parentk-1[i] is the parent of index 'i' at the (k-1)-th level of the tree.
    • left_childk-1[j] denotes the left_child of index 'j' at the (k-1)-th level of the tree.
  3. Similarly for each index 'i',  check if arr[i] > arr[parentk-1[i]].
    • If it is true, then assign the parent of 'i' to right_childk-1[parentk-1[i]], i.e. parentk[i] = right_childk-1[parentk-1[i]].
  4. Let us denote parentk[i] computed from one of the last two steps (either 2 or 3) by 'x', i.e. x = parentk[i].
    • If arr[i] <= x, then left_childk[x] = i.
    • Else right_childk[x] = i.

Note that for each index 'i' at each level k of the BST, is independent of the other indices and thus can be parallelized. Also note that in the 4th step of the above algorithm, the 'left_childk[x]' is always assigned the last index 'i' in the input array satisfying the condition arr[i] <= x and similarly for 'right_childk[x] = i' because it always overwrites the previously assigned value to left_childk[x] and right_childk[x].

The following C++ program demonstrates the above algorithm. In order to verify that the above algorithm in-fact produces the correct BST, we generate an inorder-traversal of the BST and compare with the sorted array of numbers. If they match then the BST is correct.

struct bstree {
    int *left_child;
    int *right_child;
    int *parent;
    bool flag;
};

bstree construct_binary_tree(float *arr, bstree g, bstree g1, const int n) {
    std::copy(g.left_child, g.left_child+n, g1.left_child);
    std::copy(g.right_child, g.right_child+n, g1.right_child);
    std::copy(g.parent, g.parent+n, g1.parent);
    
    g1.flag = false;
    
    for (int i = 0; i < n; i++) {
        int y = g.parent[i];
        
        if (g.left_child[i] == -1 && g.right_child[i] == -1 && i != g.parent[i]) {
            g1.flag = true;
            int x;
            
            if (arr[i] <= arr[y]) {
                x = (g.left_child[y] != -1) ? g.left_child[y]:y;
                g1.parent[i] = x;
            }
            else {
                x = (g.right_child[y] != -1) ? g.right_child[y]:y;
                g1.parent[i] = x;
            }
            
            if (i != x) {
                if (arr[i] <= arr[x]) {
                    if (g1.left_child[x] == -1) {
                        g1.left_child[x] = i;
                    }
                }
                else {
                    if (g1.right_child[x] == -1) {
                        g1.right_child[x] = i;
                    }
                }
            }
        }
    }
            
    return g1;
}

bstree bs_tree(float *arr, int root_index, bstree g, bstree g1, const int n) {
    g.flag = true;
    
    while (g.flag) {
        g = construct_binary_tree(arr, g, g1, n);
    }
    
    return g;
}

float *traversal(float *arr, int *left_child, int *right_child, int root_index, const int n) {
    int *stack = new int[n];
    float *out = new float[n];
    
    stack[0] = root_index;
    int p = 1;
    int i = 0;

    std::set<int> visited;

    while (p > 0) {
        int curr_root = stack[p-1];
        
        if (left_child[curr_root] != -1 && visited.find(left_child[curr_root]) == visited.end()) {
            stack[p++] = left_child[curr_root];
        }
        else {
            if (visited.find(curr_root) == visited.end()) {
                out[i++] = arr[curr_root];
                visited.insert(curr_root);
            }
            
            if (right_child[curr_root] != -1 && visited.find(right_child[curr_root]) == visited.end()) {
                stack[p++] = right_child[curr_root];
            }
            else {
                p -= 1;
            }
        }
    }
    
    return out;
}

float *inorder_traversal(float *arr, bstree g, bstree g1, const int n) {
    int root_index = rand() % static_cast<int>(n);
    std::fill(g.parent, g.parent+n, root_index);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    g = bs_tree(arr, root_index, g, g1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    return traversal(arr, g.left_child, g.right_child, root_index, n);
}

bool check_correctness(float *arr, float *pred_arr, const int n) {
    std::sort(arr, arr+n);
    for (int i = 0; i < n ; i++) {
        if (arr[i] != pred_arr[i]) {
            return false;
        }
    }
    return true;
}
    
int main(void) {
    int n = 1 << 25;
    
    float *arr, *temp;
    arr = new float[n];
    
    random_vector(arr, n, 0, 10000);
    
    temp = new float[n];
    std::copy(arr, arr+n, temp);
    
    bstree g, g1;
    
    g.left_child = new int[n];
    g.right_child = new int[n];
    g.parent = new int[n];
    
    g1.left_child = new int[n];
    g1.right_child = new int[n];
    g1.parent = new int[n];
    
    std::fill(g.left_child, g.left_child+n, -1);
    std::fill(g.right_child, g.right_child+n, -1);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    float *pred = inorder_traversal(arr, g, g1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << check_correctness(temp, pred, n) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    
    duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;

    return 0;
}

The method 'bs_tree' constructs the BST, whereas the method 'traversal' does the in-order traversal of the BST and produces a sorted sequence of numbers. The 'traversal' method uses a stack instead of recursion (due to stack overflow issues) to do the traversal and populate the final output array.

Now the above code is in a format where we can safely go and write our CUDA implementation. We note that we just need to modify the 'construct_binary_tree' method and take out the for loop into a CUDA kernel as shown in the following program:

#define BLOCK_SIZE 1024

struct bstree {
    int *left_child;
    int *right_child;
    int *parent;
    bool *flag;
};

__global__ void populate_child_parent(float *arr, int *i_left_child, int *i_right_child, int *i_parent, int *o_left_child, int *o_right_child, int *o_parent, bool *flag, const int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (i < n) {
        int y = i_parent[i];

        if (i_left_child[i] == -1 && i_right_child[i] == -1 && i != y) {
            flag[0] = true;
            int x, p;

            if (arr[i] <= arr[y]) {
                p = i_left_child[y];
                x = (p != -1) ? p:y;
                o_parent[i] = x;
            }
            else {
                p = i_right_child[y];
                x = (p != -1) ? p:y;
                o_parent[i] = x;
            }

            if (i != x) {
                if (arr[i] <= arr[x]) {
                    if (o_left_child[x] == -1) {
                        o_left_child[x] = i;
                    }
                }
                else {
                    if (o_right_child[x] == -1) {
                        o_right_child[x] = i;
                    }
                }
            }
        }
    }
}

__global__ void copy_arr(int *in_arr, int *out_arr, const int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (i < n) {
        out_arr[i] = in_arr[i];
    }
}

void copy_array(int *arr1, int *arr2, const int n) {
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    copy_arr<<<(n + BLOCK_SIZE - 1)/BLOCK_SIZE, BLOCK_SIZE, 0, stream>>>(arr1, arr2, n);
    cudaDeviceSynchronize();
    cudaStreamDestroy(stream);
}

bstree construct_binary_tree(float *arr, bstree g, bstree g1, const int n) {
    copy_array(g.left_child, g1.left_child, n);
    copy_array(g.right_child, g1.right_child, n);
    copy_array(g.parent, g1.parent, n);
    
    g1.flag[0] = false;
    
    populate_child_parent<<<(n + BLOCK_SIZE - 1)/BLOCK_SIZE, BLOCK_SIZE>>>(arr, g.left_child, g.right_child, g.parent, g1.left_child, g1.right_child, g1.parent, g1.flag, n);
    cudaDeviceSynchronize();
            
    return g1;
}

bstree bs_tree(float *arr, int root_index, bstree g, bstree g1, const int n) {
    g.flag[0] = true;
    
    while (g.flag[0]) {
        g = construct_binary_tree(arr, g, g1, n);
    }
    
    return g;
}
    
int main(void) {
    int n = 1 << 25;
    
    float *arr, *temp;
    cudaMallocManaged(&arr, n*sizeof(float));
    
    random_vector(arr, n, 0, 10000);
    
    temp = new float[n];
    std::copy(arr, arr+n, temp);
    
    bstree g, g1;
    
    cudaMallocManaged(&g.left_child, n*sizeof(int));
    cudaMallocManaged(&g.right_child, n*sizeof(int));
    cudaMallocManaged(&g.parent, n*sizeof(int));
    cudaMallocManaged(&g.flag, sizeof(bool));
    
    cudaMallocManaged(&g1.left_child, n*sizeof(int));
    cudaMallocManaged(&g1.right_child, n*sizeof(int));
    cudaMallocManaged(&g1.parent, n*sizeof(int));
    cudaMallocManaged(&g1.flag, sizeof(bool));
    
    std::fill(g.left_child, g.left_child+n, -1);
    std::fill(g.right_child, g.right_child+n, -1);
    
    auto t1 = std::chrono::high_resolution_clock::now();
    float *pred = inorder_traversal(arr, g, g1, n);
    auto t2 = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    t1 = std::chrono::high_resolution_clock::now();
    std::cout << check_correctness(temp, pred, n) << std::endl;
    t2 = std::chrono::high_resolution_clock::now();
    
    duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count();

    std::cout << duration << std::endl;
    
    cudaFree(arr);
    
    cudaFree(g.left_child);
    cudaFree(g.right_child);
    cudaFree(g.parent);
    cudaFree(g.flag);
    
    cudaFree(g1.left_child);
    cudaFree(g1.right_child);
    cudaFree(g1.parent);
    cudaFree(g1.flag);

    return 0;
}

Note that in the above CUDA implementation we are also pushing the part of copying the left_child, right_child and parent arrays from the last level iteration to the current iteration into separate kernels and into different CUDA streams for speed optimization.

Copying an array of size N on CPU is O(N) whereas on GPU it is O(N/P) where P is the BLOCK_SIZE (=1024). Thus there is a speedup of 1024x in copying the arrays using GPU kernels instead of CPU.

To give an estimate of how fast the above CUDA implementation for BST construction is: It takes only around 925 milliseconds to construct the BST on GPU with 33 million real numbers as compared to 63 seconds on the CPU using un-parallelized code, i.e. a speedup of around 70x !!!

All the above codes are in my Github repository.

Note that the parallelization steps are only for the BST construction part and not for sorting the entire array. For sorting on the GPU, please refer my earlier post on sorting using Bitonic Sort on the GPU. It is difficult to parallelize the in-order traversal with the above algorithm. But one can look at Dynamic Parallelism on CUDA for the in-order traversal.

Reference:

Categories: DESIGN, MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,

Leave a Reply