Working with deep learning algorithms on GPUs especially the speedups gained in case of CNN and Tensor multiplications with GPU as compared to CPU got me fascinated about learning more about how GPUs work internally, how serial codes (algorithms) can be parallelized on GPU using the CUDA programming paradigm.

I have always been a proponent of faster codes and always try out ways to improve speed of existing codes by either improving complexity of algorithms or using compiled languages such as C/C++/JAVA variants in-place of scripting languages such as R or Python. As a machine learning engineer, code speedups of 10-20x helps run experiments faster and thus enable creating prototypes or production ready models in minutes or hours instead of days.

In this series of posts I will be exploring how we can think about some simple and common algorithms in terms of parallel implementations in C++11 and CUDA on GPU enabled machines. In this 1st post of the series I will be showing common matrix operations that we frequently encounter in machine learning and data science :

- Matrix Transpose
- Matrix-Matrix Multiplication
- Matrix-Matrix Convolutions

These operations are the backbone of many ML algorithms:

- In linear regression, to find the analytical solution for the weights we use , where X is the input feature matrix and Y are the labels.
- In neural networks, we heavily use matrix multiplications for computing the weighted inputs to a hidden layer where X is the output from the previous layer and W is the weight matrix between the previous layer and current layer.
- In CNN, we need to perform convolution of the input image array (or intermediate feature matrix) with 3x3 or 5x5 kernels to generate features for successive layers.

Before diving into the CUDA/C++11 implementations for the matrix operations, I would like to give a brief overview of the important concepts regarding programming using CUDA on the GPU. These concepts are easily available over the internet.

Let's begin with a simple CUDA program that takes two 1-D vectors and sums them element-wise to produce a 3rd vector.

The idea is that since each component of the output vector is independently computed thus we can parallelize their computation so as to reduce the run-time.

#include <iostream> //define number of threads in a block #define BLOCK_SIZE 32 //kernel function runs on device __global__ void sum_vector(const int *vec1, const int *vec2, int *output, int n) { //get the index of the element to sum. //in each block, the index of current thread is threadIdx.x, since each block has 'blockDim.x' number of threads and //index of current block is 'blockIdx.x', thus the global index of current thread is 'blockIdx.x * blockDim.x + threadIdx.x'. int thread = blockIdx.x * blockDim.x + threadIdx.x; //when the number of threads do not evenly divide the number of elements in array, the current thread number can be beyond n if (thread < n) { output[thread] = vec1[thread] + vec2[thread]; } } int main(void) { int n = 1000; int *vec1, *vec2, *output; //allocate Unified Memory -- accessible from both CPU or GPU cudaMallocManaged(&vec1, n*sizeof(int)); cudaMallocManaged(&vec2, n*sizeof(int)); cudaMallocManaged(&output, n*sizeof(int)); //generate random integer values through some function. vec1 = random_ints(n); vec2 = random_ints(n); //number of blocks of threads required is a ceiling function of the number of elements in array divided by the number of threads per block. int num_blocks = (n+BLOCK_SIZE-1)/BLOCK_SIZE; //call kernel function with number of blocks of threads and number of threads per block. sum_vector<<<num_blocks, BLOCK_SIZE>>>(vec1, vec2, output, n); cudaDeviceSynchronize(); //free allocated unified memory cudaFree(vec1); cudaFree(vec2); cudaFree(output); return 0; }

To run the above code, save it in a file and name it something like 'sum_1d_vectors.cu', then compile and run it using the following commands:

nvcc sum_1_vectors.cu -o sum_1_vectors ./sum_1_vectors

Note that there are many things that differ from a standard C program.

- '__global__' keyword before the 'sum_vector' function denotes that this function runs on the GPU device and is called a kernel function.
- 'cudaMallocManaged' in the main method is used to allocate memory for the variables that can be accessed by both CPU and GPU.
- In GPU, operations are performed in parallel by threads. Threads are arranged in blocks and grids (see below diagram).
- A grid contains blocks in a 1-D or 2-D or 3-D layout.
- A block contains threads in a 1-D or 2-D or 3-D layout.

- While calling the 'sum_vector' kernel method, we need to pass the following <<<num_blocks, BLOCK_SIZE>>>, where 'num_blocks' is the number of blocks and BLOCK_SIZE is the number of threads per block.
- We have fixed the number of threads per block to 32 in the above code.
- The number of blocks required is thus computed to be the number of elements in the array divided by 32.
- In case the number of elements in the array is not a multiple of 32, we need to take the ceiling function which is written as above (n+BLOCK_SIZE-1)/BLOCK_SIZE.

- The kernel function call is asynchronous, i.e. control immediately returns to CPU after call to the 'sum_vector' method.
- 'cudaDeviceSynchronize' ensures that if there are any CPU operations that require the output variable, then those operations wait till the kernel has completed executing all threads and populating the output variable.

- Allocated memory is freed using the 'cudaFree' method call on the variable.
- Inside the kernel function, the index of the current thread can be accessed by the formula:
- blockIdx.x * blockDim.x + threadIdx.x
- blockIdx.x is the index of the current block, blockDim.x is the number of threads per block and threadIdx.x is the index of the thread in the current block.
- The sum of the m-th elements of the vectors are computed by the m-th thread.

- The kernel function do not return anything so its data type is always 'void'. The output variable needs to be passed in the parameter and populated within the kernel.

Since now we are equipped with the basic concepts of a CUDA program, let's write our matrix transpose program. To begin with, we will write our matrix transpose program single threaded for CPU in C++11 for our reference:

#include <iostream> #include <math.h> #include <time.h> #include <stdlib.h> #include <random> #include <vector> #include <chrono> std::vector<std::vector<int>> random_matrix(const int num_rows, const int num_cols, const int min_val=0.0, const int max_val=1000.0) { std::vector<std::vector<int>> my_arr; static std::random_device rd; static std::mt19937 mte(rd()); std::uniform_int_distribution<int> dist(min_val, max_val); for (int i = 0; i < num_rows; i++) { std::vector<int> my_arr_col; for (int j = 0; j < num_cols; j++) { my_arr_col.push_back(dist(mte)); } my_arr.push_back(my_arr_col); } return my_arr; } void transpose(int *odata, const int *idata, const int n, const int m) { for (int i = 0; i < n*m; i++) { int y = i/m; int x = i % m; odata[n*x + y] = idata[i]; } } int main(void) { int n = 2000; int m = 5000; int *idata, *odata; idata = (int *)malloc(n*m*sizeof(int)); odata = (int *)malloc(n*m*sizeof(int)); std::vector<std::vector<int>> my_arr = random_matrix(n, m, 0.0, 100.0); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { idata[m*i + j] = my_arr[i][j]; } } auto t1 = std::chrono::high_resolution_clock::now(); transpose(odata, idata, n, m); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; free(idata); free(odata); return 0; }

Compile and run the above program as follows:

g++ -std=c++11 matrix_transpose.cpp -o matrix_transpose ./matrix_transpose

Few observations from the above program are:

- Note that we are transforming a 2D matrix of integers into a 1D array in the main function. This is because when we will take this program to a GPU, it is easier to work with flattened 1D arrays instead of multi-dimensional arrays in the kernel.
- Elements in a 2D matrix are arranged in row-major format in C/C++.
- Any index (i, j) in a 2D matrix is equivalent to the index m*i+j in a 1D array where 'm' is the number of columns of the matrix.

- The rest of the code is pretty intuitive. Note that we are timing the execution of the 'transpose' function. This will be used for benchmarking our timings for the GPU code.
- Generate a 2000x5000 matrix with random integers from uniform distribution (0,100)
- Convert this matrix to a 1D array using the m*i+j formula above.
- Transpose the 1D array using the following logic:
- If an index of an element in the original array is 'k', then the corresponding row in 2D format would be k/m and corresponding column would be k mod m, where m is the number of columns of the original matrix.
- Thus k -> 2D -> (k/m, k mod m) -> transpose -> (k mod m, k/m) -> 1D -> n*(k mod m) + k/m where n is the number of rows in the original matrix.

Now we can move to our CUDA & C++11 implementation for the matrix transpose program:

#define TILE_DIM 32 #define BLOCK_ROWS 8 __global__ void transposeNaive(int *odata, const int *idata, const int n, const int m) { int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) { if (x < m && (y+j) < n) { odata[x*n + (y+j)] = idata[(y+j)*m + x]; } } } bool check_correctness(int *odata, const int *idata, const int n, const int m) { for (int i = 0; i < n*m; i++) { int y = i/m; int x = i % m; if ((n*x + y) >= n*m || odata[n*x + y] != idata[i]) { return false; } } return true; } int main(void) { int n = 2000; int m = 5000; dim3 dimGrid((m + TILE_DIM - 1)/TILE_DIM, (n + TILE_DIM - 1)/TILE_DIM, 1); dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1); int *idata, *odata; cudaMallocManaged(&idata, n*m*sizeof(int)); cudaMallocManaged(&odata, n*m*sizeof(int)); std::vector<std::vector<int>> my_arr = random_matrix(n, m, 0.0, 100.0); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { idata[m*i + j] = my_arr[i][j]; } } auto t1 = std::chrono::high_resolution_clock::now(); transposeNaive<<<dimGrid, dimBlock>>>(odata, idata, n, m); cudaDeviceSynchronize(); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; std::cout << check_correctness(odata, idata, n, m) << std::endl; cudaFree(idata); cudaFree(odata); return 0; }

To compile and run the above code, use the following commands:

nvcc -std=c++11 matrix_transpose_cuda.cu -o matrix_transpose_cuda ./matrix_transpose_cuda

Let's analyse the above code and break down the important components in the above CUDA code:

- In the CUDA program for matrix transpose, we are dividing a 2D matrix of dimensions nxm into tiles of size 32x32.
- Each block of threads operates on 1 tile. Each thread in a block operates on one element of the tile.
- Thus the number of blocks equals the number of tiles.
- For a nxm matrix with tiles of size 32x32, the number of tiles along the x-axis (horizontal direction) is (m-32+1)/32 because there are m elements along the x-axis.
- Similarly the number of tiles along the y-axis (vertical direction) is ( n-32+1)/32.
- We are using 2D blocks instead of 1D blocks unlike our sum of vectors example.

- dim3 dimGrid((m + 32 - 1)/32, (n + 32 - 1)/32, 1)
- In our earlier sum of vectors example, the grid size was equal to the number of blocks.
- In this case, the blocks are laid out in 2D manner, thus we need to use 3D grid layout.
- The 1st term (m + 32 - 1)/32 denotes the number of blocks along x-axis. The 2nd term (n + 32 - 1)/32 denotes the number of blocks along y-axis.
- Since we have elements along 2 axes, thus the z-axis is set to 1.

- dim3 dimBlock(32, 8, 1)
- Now each block is also 2D layout of threads.
- We are defining each block such that the number of threads along x-axis to be 32, number of threads along y-axis to be 8.
- Note that the number of threads along the y-axis is 8 and not 32 which would have made sense because the size of each block is 32x32.
- This has been deliberately done so that each thread in a block along the y-axis computes the transpose of 4 elements (32/8 = 4).
- Sometimes it is more optimal to re-use threads for computation over the cost of creating additional threads.

- In the 'transposeNaive' kernel function, we obtain the current thread index as before. Note that now we access both horizontal and vertical indices for a thread.
- In each block of thread, the element corresponding to the location (y, x) is given as idata[y*m + x].
- Note that the row index is given by 'y' and column index by 'x' because 'y' is the index along vertical direction and 'x' along horizontal.
- We are using the additional loop over 'j' to allow each thread in a block to cover 4 elements along the y-axis.
- Thread at location (y, x) will transpose elements at (y, x), (y+8, x), (y+16, x) and (y+24, x).
- In this way threads from (y, x) to (y+7, x) will cover all 32 elements in a block from (y, x) to (y+31, x).

- Since during transpose, the location (y+j, x) in the original matrix of size nxm will become (x, y+j) in the transposed matrix of size mxn, thus the 1D index corresponding to (x, y+j) in the transposed matrix of size mxn would be x*n + (y+j).
- odata[x*n + (y+j)] = idata[(y+j)*m + x]

- It is always a good idea to verify that the transposed matrix through the parallel CUDA program is actually correct.
- For that we are using 'check_correctness' method.
- In this method, we transpose the matrix using the standard CPU code (as earlier) and verify that each element in the CUDA transposed matrix matches.

Now its time to compare how the CUDA matrix transpose program performs as compared to the single threaded CPU version. To do profiling of various function calls for the CUDA program, we run the following command:

nvprof ./matrix_transpose_cuda

Note however that the above method is not optimized. The threads in a block access the global memory indices of a 2D matrix in row-major order due to the memory coalescing capability of accessing consecutive elements by consecutive threads.

Recall that when we are transforming a 2D matrix of dimensions nxm into a 1D array, we are flattening the 2D array in the below sequence:

[(0,0) (0,1) ... (0,m-1) (1,0) (1,1) ... (1,m-1) (2,0) .... (n-1,m-1)] - This is row-major order

Now the threads in a block are laid out in such a way that the (i, j)-th thread accesses the (i, j)-th element from the above array. When we read the 'idata' input matrix as:

idata[(y+j)*m + x]

the threads are reading the elements (for a fixed j) in the same order as they appear in the 'idata' array. This leads to memory coalescing and gives very high throughput in reading the data.

But note however that when we are writing the output to the 'odata' array, we are writing it in the following way:

odata[x*n + (y+j)]

i.e. for a fixed j, this is a column major order of writing (because we are transposing the array). The elements of the 2D matrix are written in the following sequence:

[(0,0) (1,0) ... (n-1,0) (0,1) (1,1) ... (n-1,1) (0,2) .... (n-1, m-1)] - This is column-major order

i.e. now the (i, j)-th thread is writing data to (j, i)-th location in the global memory. This will not lead to global memory coalescing because consecutive threads in a block writes to memory locations that are separated by 'm' positions.

Thread (0,0) writes to location (0,0) in odata but thread (0,1) writes to location (1,0) which is separated by 'm' locations in the global memory:

**(0,0)** (0,1) ... (0,m-1) **(1,0)** (1,1) ... (1,m-1) **(2,0)**

To overcome the above issue, we can leverage something known as the shared memory for each block of threads. Shared memory is like a cache for all threads in a block. Each block has its own shared memory (local cache) and are independent from other blocks. All threads in a block can read and write to its own block's shared memory.

Shared memory is useful in cases where we need to cache some computation for a block of thread so that other threads in the same block can re-use those values. For e.g. compute a 1D stencil where the output at index i is the sum of the inputs from indices i-k to i+k for some k > 0.

Kernel method leveraging shared memory for matrix transpose:

__global__ void transposeSharedMem(int *odata, const int *idata, const int n, const int m) { __shared__ int tile[TILE_DIM*TILE_DIM]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { tile[(threadIdx.y+j)*TILE_DIM + threadIdx.x] = idata[(y+j)*m + x]; } __syncthreads(); x = blockIdx.y * TILE_DIM + threadIdx.x; y = blockIdx.x * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) { if (x < n && (y+j) < m) { odata[(y+j)*n + x] = tile[threadIdx.x*TILE_DIM + threadIdx.y+j]; } } }

We are declaring a shared array tile[TILE_DIM*TILE_DIM] for each block of threads. The functionality of this shared array is to copy the contents of the input array corresponding to the elements lying inside the current thread block.

We need to do '__syncthreads()' before accessing the elements of the shared array because some thread may be trying to access an element of the shared array for which has not be written yet by another thread. Thus we need to wait for all threads to populate the shared array.

Next we write to the output array in the coalesced sequence (0,0) (0,1) ... etc. but read from the shared array in transpose way i.e. column major way (0,0) (1,0) (2,0) ... etc. This do not affect the reading performance because we are reading from shared memory here and not the global memory.

Also observe the x-index and the y-index of the odata array. The x-index is computed as:

x = **blockIdx.y** * TILE_DIM + threadIdx.x

Because in the output array the blocks are transposed (blockIdx.y, blockIdx.x) -> (blockIdx.x, blockIdx.y).

But reading from the shared memory in a column major order has a major drawback which we have not discussed above. This is due to the shared memory bank conflict.

The shared memory is arranged in memory banks. There are 32 memory banks available for each block of thread. The addresses are assigned to memory banks in round-robin fashion. For a 32x32 block it implies that each column is assigned to a particular memory bank, i.e. column 0 is assigned to bank 0, column 1 to bank 1 and so on.

While reading in row major order, a warp of 32 threads reads the shared memory from 32 different banks (because in a row all addresses are assigned a different memory bank) and thus there are no bank conflicts, but in transpose operation as above, while writing to the output array, we are reading the shared array tile in column major order which implies that we are reading from the same memory bank and thus reads are serialised in each column.

This leads performance degradation. One solution is to use 33 columns instead of 32 columns. This will make sure that addresses in each column are assigned to a different memory bank and there will be no bank conflict.

__global__ void transposeSharedMem(int *odata, const int *idata, const int n, const int m) { __shared__ int tile[TILE_DIM * (TILE_DIM+1)]; int x = blockIdx.x * TILE_DIM + threadIdx.x; int y = blockIdx.y * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS) { tile[(threadIdx.y+j) * (TILE_DIM+1) + threadIdx.x] = idata[(y+j)*m + x]; } __syncthreads(); x = blockIdx.y * TILE_DIM + threadIdx.x; y = blockIdx.x * TILE_DIM + threadIdx.y; for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS) { if (x < n && (y+j) < m) { odata[(y+j)*n + x] = tile[threadIdx.x * (TILE_DIM+1) + threadIdx.y+j]; } } }

But it may not be apples to apples comparison if we are comparing single threaded CPU implementation with multi-threaded GPU implementation. Better would be to compare multi-threaded CPU implementation with the GPU implementation.

Thus we write a multi-threaded C++11 implementation for the CPU:

#include <iostream> #include <math.h> #include <time.h> #include <stdlib.h> #include <random> #include <vector> #include <chrono> #include <thread> void transpose(int *odata, const int *idata, const int n, const int m, const int start, const int end) { for (int i = start; i < end; i++) { int y = i/m; int x = i % m; odata[n*x + y] = idata[i]; } } bool check_correctness(int *odata, const int *idata, const int n, const int m) { for (int i = 0; i < n*m; i++) { int y = i/m; int x = i % m; if ((n*x + y) >= n*m || odata[n*x + y] != idata[i]) { return false; } } return true; } int main(void) { int n = 2000; int m = 5000; int n_threads = 50; int *idata, *odata; std::vector<std::thread> threads(n_threads); idata = (int *)malloc(n*m*sizeof(int)); odata = (int *)malloc(n*m*sizeof(int)); std::vector<std::vector<int>> my_arr = random_matrix(n, m, 0.0, 100.0); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { idata[m*i + j] = my_arr[i][j]; } } auto t1 = std::chrono::high_resolution_clock::now(); int k = (n*m + n_threads - 1)/n_threads; for (int i = 0; i < n_threads; i++) { int start = k*i; int end = std::min(k*(i+1), n*m); threads[i] = std::thread(transpose, odata, idata, n, m, start, end); } for (int i = 0; i < n_threads; i++) { threads[i].join(); } auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; std::cout << check_correctness(odata, idata, n, m) << std::endl; free(idata); free(odata); return 0; }

To compile and run the above code:

g++ -std=c++11 -pthread matrix_transpose_threaded.cpp -o matrix_transpose_threaded ./matrix_transpose_threaded

Most of the code above is similar to the single threaded implementation.

We are spinning up 50 threads and the original array (after transformation from 2D to 1D) is equally divided among the 50 threads. Each thread transposes only its portion of the sub-array in parallel, i.e. with 2000*5000 integers, each thread transposes 2*10^{5} elements.

Thread 0 transposes indices 0 to 2*10^{5} - 1, thread 1 transposes indices 2*10^{5} to 4*10^{5} - 1 and so on.

It may be worthwhile to ponder that why are we using only 50 threads. One can obtain the maximum possible number of threads for a process by executing the C code here. In my machine the maximum possible number of threads is 32754.

We note that using 50 threads, the time required to transpose 2000x5000 matrix was around 50 milliseconds whereas with 32K threads it was around 1.6 seconds. Thus more number of threads is not always better performance.

Here is the table of final performance numbers with 2000x5000 matrix:

Approach |
Time in Milliseconds |

Single Threaded CPU | 250 |

Multi-Threaded CPU (50 Threads) | 50 |

CUDA NVIDIA Tesla P100 (without shared memory) | 27 |

CUDA NVIDIA Tesla P100 (with shared memory and avoiding memory bank conflicts) | 21 |

Thus we see that although multi-threaded C++ program performs about 5x better than single threaded program but with optimized GPU CUDA code on our NVIDIA Tesla P100 machine, the program performs about 12x better than single threaded program and about 2.4x better than multi-threaded CPU program.

Next we move on to our matrix multiplication application. Matrix multiplication is probably the most common matrix operation in both machine learning as well as deep learning. Thus this needs special attention. We begin with our naive C++11 implementation i.e. single-threaded CPU program:

#include <iostream> #include <math.h> #include <time.h> #include <stdlib.h> #include <random> #include <vector> #include <chrono> void multiply(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int m, const int p) { for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { int sum = 0; for (int k = 0; k < m; k++) { sum += mat_1[i*m+k] * mat_2[k*p+j]; } mat_prod[i*p+j] = sum; } } } int main(void) { int n = 2000; int m = 1000; int p = 5000; int *mat_1, *mat_2, *mat_prod; mat_1 = (int *)malloc(n*m*sizeof(int)); mat_2 = (int *)malloc(m*p*sizeof(int)); mat_prod = (int *)malloc(n*p*sizeof(int)); std::vector<std::vector<int>> my_arr_1 = random_matrix(n, m, 0, 10); std::vector<std::vector<int>> my_arr_2 = random_matrix(m, p, 0, 10); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { mat_1[m*i + j] = my_arr_1[i][j]; } } for (int i = 0; i < m; i++) { for (int j = 0; j < p; j++) { mat_2[p*i + j] = my_arr_2[i][j]; } } auto t1 = std::chrono::high_resolution_clock::now(); multiply(mat_1, mat_2, mat_prod, n, m, p); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; free(mat_1); free(mat_2); free(mat_prod); return 0; }

The above code is pretty straightforward if you know C/C++ well or have followed this post from the beginning. The idea is straightforward.

- Initialize two random matrices 'mat_1' of dimensions (n x m) 2000x1000 and 'mat_2' of dimensions (m x p) 1000x5000. Fill them with integers sampled from uniform distribution (0,10).
- As before convert the 2D matrices to 1D contiguous arrays, i.e. index (i, j) is equivalent to index i*m+j where 'm' is the number of columns.
- Multiply the two matrices using the standard O(n*m*p) complexity algorithm.
- Observe that we are not using any sub-O(n
^{3}) algorithm (such as Strassen's algorithm O(n^{2.79})) or cache-oblivious algorithms to optimize the multiply method.

Now since we have our benchmark algorithm ready, we move to the CUDA implementation:

#define TILE_DIM 32 __global__ void multiplyNaive(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int m, const int p) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (y < n && x < p) { int sum = 0; for (int i = 0; i < m; i++) { sum += mat_1[y*m+i] * mat_2[i*p+x]; } mat_prod[y*p+x] = sum; } } bool check_correctness(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int m, const int p) { for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { int sum = 0; for (int k = 0; k < m; k++) { sum += mat_1[i*m+k] * mat_2[k*p+j]; } if (sum != mat_prod[i*p+j]) { return false; } } } return true; } int main(void) { int n = 2000; int m = 1000; int p = 5000; dim3 dimGrid((p + TILE_DIM - 1)/TILE_DIM, (n + TILE_DIM - 1)/TILE_DIM, 1); dim3 dimBlock(TILE_DIM, TILE_DIM, 1); int *mat_1, *mat_2, *mat_prod; cudaMallocManaged(&mat_1, n*m*sizeof(int)); cudaMallocManaged(&mat_2, m*p*sizeof(int)); cudaMallocManaged(&mat_prod, n*p*sizeof(int)); std::vector<std::vector<int>> my_arr_1 = random_matrix(n, m, 0, 10); std::vector<std::vector<int>> my_arr_2 = random_matrix(m, p, 0, 10); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { mat_1[m*i + j] = my_arr_1[i][j]; } } for (int i = 0; i < m; i++) { for (int j = 0; j < p; j++) { mat_2[p*i + j] = my_arr_2[i][j]; } } multiplyNaive<<<dimGrid, dimBlock>>>(mat_1, mat_2, mat_prod, n, m, p); cudaDeviceSynchronize(); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; std::cout << check_correctness(mat_1, mat_2, mat_prod, n, m, p) << std::endl; cudaFree(mat_1); cudaFree(mat_2); cudaFree(mat_prod); return 0; }

The logic for parallel matrix multiplication is quite simple. For the matrix product C=A.B, compute each element of C in a different thread. In CUDA implementation, we divide the matrix C into blocks of size 32x32, i.e. each block contains 32x32 threads.

To compute C[y][x] where y is the row index and x is the column index, the equivalent 1D index would be C[y*p+x] where 'p' is the number of columns in C. We can obtain the thread index corresponding to the the y-th row and x-th column (as before in our transpose code):

x = blockIdx.x * blockDim.x + threadIdx.x

y = blockIdx.y * blockDim.y + threadIdx.y

To compute C[y*p+x] we need to iterate over all columns for the y-th row in matrix A and over all rows for the x-th column in matrix B and take their dot product.

sum += A[y*m+i] * B[i*p+x], for all i from 0 to m-1

Note that for each thread corresponding to index (y, x) we need to iterate all columns of A and all rows of B every-time.

From above we know that the threads in a block are arranged in row-major order due to global memory coalescing i.e. the threads in a 32x32 block are arranged in the following order:

[(0,0) (0,1) ... (0,31) (1,0) (1,1) .....(31,31)]

Thus for a fixed value of i in the summation formula above, when we go from (0,0) to (0,31), the index of the matrix A do not change because it is A[y*m+i] and is only dependent on the y value. Similarly for any row y, from (y,0) to (y, 31). Thus for every 32 consecutive threads it accesses only a single value from the matrix A.

Thus reading bandwidth for A reduces by a factor of 32.

No such problem arises with the matrix B, because for a fixed i, consecutive threads accesses consecutive elements in B. But we all know now that for each element (y, x) of C, we need to read all rows from the global memory corresponding to column 'x' in B.

Thus each element of B is read 'n' times from the global memory, where 'n' is the number of rows in C. Similarly each element in A is read 'p' times from the global memory, where 'p' is the number of columns in C.

The above problems can be resolved by leveraging shared memory in the kernel function.

- Shared memory is much faster (~100x) than global memory. Thus we can cache the matrices A and B once in shared memory and then read each element multiple times from shared memory.
- Also we will read the matrix A into shared memory once in the row-major order as shown above (due to global memory coalescing). Then for the summation, we can read the values in column major order from the shared memory.

One possible implementation for the kernel with shared memory:

__global__ void multiplySharedMem(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int p) { __shared__ int aTile[TILE_DIM][M_DIM], bTile[M_DIM][TILE_DIM]; int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; aTile[threadIdx.y][threadIdx.x] = mat_1[y*M_DIM+threadIdx.x]; bTile[threadIdx.y][threadIdx.x] = mat_2[threadIdx.y*p+x]; __syncthreads(); if (y < n && x < p) { int sum = 0; for (int i = 0; i < M_DIM; i++) { sum += aTile[threadIdx.y][i] * bTile[i][threadIdx.x]; } mat_prod[y*p+x] = sum; } }

Note that the variable M_DIM is actually the common dimension 'm' of the two matrices (m=1000). Since each thread block in C is of size 32x32, thus the relevant sub-matrix in A will be of size 32xm and relevant sub-matrix in B will be of size mx32. Thus the shared memory matrix that will hold A has to be of the size 32xm and for B it is mx32.

The above kernel will work well for cases where m or M_DIM is small (~50-60) but in most practical scenarios the value of m can be very large (for e.g. in our example m=1000) and will throw shared memory errors.

Instead we can divide the common dimension m into k equal parts of size 32 each i.e. m_{1}, m_{2}, ..., m_{k} where m_{i}=32 for 0 < i < k-1 and m_{i} <= 32 for i = k. Thus the matrix A is now divided into k equal parts along the x-axis A_{1}, A_{2}, ... A_{k} where A_{i} is a sub-matrix of shape n x 32 (n rows and 32 columns). Similarly matrix B is divided into k equal parts along the y-axis B_{1}, B_{2}, ... B_{k} where B_{i} is a sub-matrix of shape 32 x p (32 rows and p columns). Thus now:

C_{i} = A_{i}.B_{i}

Observe that C_{i} is n x p matrix similar to the matrix C but C_{i} is incomplete. To obtain C from C_{i}, we need to add up all C_{i}'s for i from 1 to k, i.e.

Each C_{i} is computed by the same kernel function as above. But now since for each i, the number of columns in A_{i} and number of rows in B_{i} is 32, thus the shared memory size is 32x32 for both 'aTile' and 'bTile'. Each tile of size 32x32 for C_{i} is computed using 32x32 sized tiles from A_{i} and B_{i} respectively. Thus the kernel function for each C_{i} is:

__global__ void multiplySharedMem(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int p) { __shared__ int aTile[TILE_DIM][TILE_DIM], bTile[TILE_DIM][TILE_DIM]; int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; aTile[threadIdx.y][threadIdx.x] = mat_1[y*M_DIM+threadIdx.x]; bTile[threadIdx.y][threadIdx.x] = mat_2[threadIdx.y*p+x]; __syncthreads(); if (y < n && x < p) { int sum = 0; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i] * bTile[i][threadIdx.x]; } mat_prod[y*p+x] = sum; } }

But the above kernel computes only one C_{i}. We need to compute C_{i} for all i's from 1 to k as well as sum all these C_{i}'s to obtain the final matrix C.

- One possible way is to call the same kernel function k times in a for loop.
- Till now we have been calling the kernel function in the default stream.
- When we call the same kernel function in the same stream (default stream) the execution is synchronous i.e. each C
_{i}will be computed sequentially.

- We need to call the kernel function for each C
_{i}inside a different stream other than the default stream.

The updated code for shared memory matrix multiplication:

#define TILE_DIM 32 __global__ void multiplySharedMem(const int *mat_1, const int *mat_2, int *mat_prod, const int n, const int m, const int p, const int start, const int end) { __shared__ int aTile[TILE_DIM][TILE_DIM], bTile[TILE_DIM][TILE_DIM]; int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; aTile[threadIdx.y][threadIdx.x] = mat_1[y*m+threadIdx.x+start]; bTile[threadIdx.y][threadIdx.x] = mat_2[(threadIdx.y+start)*p+x]; __syncthreads(); int dims_tile = end-start; if (y < n && x < p) { int sum = 0; for (int i = 0; i < dims_tile; i++) { sum += aTile[threadIdx.y][i] * bTile[i][threadIdx.x]; } mat_prod[y*p+x] += sum; } } int main(void) { int n = 2000; int m = 1000; int p = 5000; dim3 dimGrid((p + TILE_DIM - 1)/TILE_DIM, (n + TILE_DIM - 1)/TILE_DIM, 1); dim3 dimBlock(TILE_DIM, TILE_DIM, 1); int *mat_1, *mat_2, *mat_prod; cudaMallocManaged(&mat_1, n*m*sizeof(int)); cudaMallocManaged(&mat_2, m*p*sizeof(int)); cudaMallocManaged(&mat_prod, n*p*sizeof(int)); std::vector<std::vector<int>> my_arr_1 = random_matrix(n, m, 0, 10); std::vector<std::vector<int>> my_arr_2 = random_matrix(m, p, 0, 10); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { mat_1[m*i + j] = my_arr_1[i][j]; } } for (int i = 0; i < m; i++) { for (int j = 0; j < p; j++) { mat_2[p*i + j] = my_arr_2[i][j]; } } int num_parts = (m+TILE_DIM-1)/TILE_DIM; cudaStream_t stream[num_parts]; auto t1 = std::chrono::high_resolution_clock::now(); for (int i = 0; i < num_parts; i++) { int start = i*TILE_DIM; int end = m < ((i+1)*TILE_DIM)?m:((i+1)*TILE_DIM); cudaStreamCreate(&stream[i]); multiplySharedMem<<<dimGrid, dimBlock, 0, stream[i]>>>(mat_1, mat_2, mat_prod, n, m, p, start, end); } cudaDeviceSynchronize(); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; std::cout << check_correctness(mat_1, mat_2, mat_prod, n, m, p) << std::endl; for (int i = 0; i < num_parts; i++) { cudaStreamDestroy(stream[i]); } cudaFree(mat_1); cudaFree(mat_2); cudaFree(mat_prod); return 0; }

- 'num_parts' equals the number of partitions of the common dimension 'm' into equal sized partitions of size 32.
- We create 'num_parts' number of CUDA streams.
- Call the kernel inside each different stream and add the C
_{i}'s to obtain the final matrix C.- Note that in the kernel we are computing the final matrix as 'mat_prod[y*p+x] += sum' and not 'mat_prod[y*p+x] = sum' because we need to sum for all C
_{i}.

- Note that in the kernel we are computing the final matrix as 'mat_prod[y*p+x] += sum' and not 'mat_prod[y*p+x] = sum' because we need to sum for all C
- We are passing the 'start' and 'end' indices to the kernel function in order to calculate the offset for the matrices A (column offset) and B (row offset).
- If 'm' is not divisible by 32, then the last partition will have size less than 32.
- Hence we are using 'dims_tile = end-start' in the kernel and looping till dims_tile instead of TILE_DIM.

As before we will be comparing the CUDA matrix multiplication implementation with a single-threaded CPU program as well as multi-threaded CPU program. I am not writing down the multi-threaded program here. But you can find it here on my Github profile.

Also as before, in order to avoid memory bank conflict with shared memory, we use 33 columns for 'aTile' and 'bTile' shared arrays instead of 32 columns.

__shared__ int aTile[TILE_DIM][TILE_DIM+1], bTile[TILE_DIM][TILE_DIM+1];

The final performance numbers are as follows:

Approach |
Time in Milliseconds |

Single Threaded CPU | 33730 |

Multi-Threaded CPU (100 Threads) | 3020 |

CUDA NVIDIA Tesla P100 (without shared memory) | 85 |

CUDA NVIDIA Tesla P100 (with shared memory) | 57 |

CUDA NVIDIA Tesla P100 (with shared memory and avoiding memory bank conflict) | 52 |

The gap between single-threaded CPU implementation and the optimized GPU implementation (shared memory + avoid memory bank conflict) is huge when compared to our matrix transpose program. Over 650x improvement with the GPU program. That's huge !!!

Matrix Convolution : CUDA GPU Implementation for Matrix Convolution.

A simple CUDA C++11 code for doing matrix convolution on GPU devices:

#define TILE_DIM 32 __global__ void convolve(const int *mat_1, const int *mat_2, int *mat_conv, const int n, const int m, const int p, const int q, const int stride, const int u, const int v) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int start_row = stride*y; int start_col = stride*x; if (y < u && x < v && start_row+p <= n && start_col+q <= m) { int sum = 0; for (int i = start_row; i < start_row+p; i++) { int i1 = i-start_row; for (int j = start_col; j < start_col+q; j++) { int j1 = j-start_col; sum += mat_1[i*m+j]*mat_2[i1*q+j1]; } } mat_conv[y*v+x] = sum; } } bool check_correctness(const int *mat_1, const int *mat_2, int *mat_conv, const int n, const int m, const int p, const int q, const int stride, const int u, const int v) { for (int i = 0; i < u*v; i++) { int r = i/v; int c = i % v; int start_row = stride*r; int start_col = stride*c; if (start_row+p <= n && start_col+q <= m) { int sum = 0; for (int i1 = start_row; i1 < start_row+p; i1++) { int i2 = i1-start_row; for (int j1 = start_col; j1 < start_col+q; j1++) { int j2 = j1-start_col; sum += mat_1[i1*m+j1]*mat_2[i2*q+j2]; } } if (sum != mat_conv[i]) { return false; } } else { return false; } } return true; } int main(void) { int n = 2000; int m = 5000; int p = 3; int q = 3; int stride = 1; int u = (n-p)/stride + 1; int v = (m-q)/stride + 1; dim3 dimGrid((v + TILE_DIM - 1)/TILE_DIM, (u + TILE_DIM - 1)/TILE_DIM, 1); dim3 dimBlock(TILE_DIM, TILE_DIM, 1); int *mat_1, *mat_2, *mat_conv; cudaMallocManaged(&mat_1, n*m*sizeof(int)); cudaMallocManaged(&mat_2, p*q*sizeof(int)); cudaMallocManaged(&mat_conv, u*v*sizeof(int)); std::vector<std::vector<int>> my_arr_1 = random_matrix(n, m, 0, 10); std::vector<std::vector<int>> my_arr_2 = random_matrix(p, q, 0, 10); for (int i = 0; i < n; i++) { for (int j = 0; j < m; j++) { mat_1[i*m + j] = my_arr_1[i][j]; } } for (int i = 0; i < p; i++) { for (int j = 0; j < q; j++) { mat_2[i*q + j] = my_arr_2[i][j]; } } auto t1 = std::chrono::high_resolution_clock::now(); convolve<<<dimGrid, dimBlock>>>(mat_1, mat_2, mat_conv, n, m, p, q, stride, u, v); cudaDeviceSynchronize(); auto t2 = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>( t2 - t1 ).count(); std::cout << duration << std::endl; std::cout << check_correctness(mat_1, mat_2, mat_conv, n, m, p, q, stride, u, v) << std::endl; return 0; }

For those who are un-aware of what convolution operation is, you can refer one of my old blogpost.

The logic of the code is pretty straightforward and similar to our previous matrix operations codes.

- We have a large matrix 'mat_1' of dimensions 2000x5000 (n x m) and a small kernel matrix 'mat_2' of size 3x3 (p x q). We convolve the kernel matrix over the large matrix and generate a matrix of shape u x v.
- We have an additional 'stride' length parameter.
- The next convolution towards right is done with a sliding window of 'stride' cells, and towards bottom with a sliding window of 'stride' cells.
- u = (n-p)/stride + 1 and v = (m-q)/stride + 1

- Divide the u x v output matrix into tiles of size 32x32.
- Block size is 32x32 threads.

- Using shared memory could be more complicated here but one can give a try.

The final performance numbers are as follows:

Single-Threaded CPU program = 11.7 seconds

Above GPU CUDA-C++11 program = 140 milliseconds.

All of the above codes are shared on my Github profile.

References available on the internet:

- CUDA Basics
- Optimize data transfers between CPU and GPU using pinned memory.
- Overlapping data transfer with kernel execution by using multiple streams for different tasks.
- Memory Coalescing using warps of threads.
- Using local shared memory and memory banks for optimizing read and write across threads.
- CUDA C Best Practices.
- CUDA Streams and Concurrency
- Unified Memory in CUDA

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: Coalescing, CUDA, GPU, Matrix Convolution, Matrix Multiplication, Memory Bank, NVIDIA, Shared Memory