We have seen how we can use kernel blocks, in one through three dimensions, in order to specify that CUDA should perform a number of function calls in parallel.

Today we will look at using CUDA threads. This allows us to work around CUDA limits on block sizes, as well as communicate between threads and improve performance in some cases.

In CUDA, there is a limit on the size of any block dimension. This limit is $2^{16} = 65,536$.

Consider the following program which computes $N$ square roots:

```
#include <stdio.h>
#include <stdlib.h>
/* 50 thousand */
#define N 50000
/* the kernel fills an array up with square roots */
__global__ void square_roots(double* array) {
/* find my array index */
unsigned int idx = blockIdx.x;
/* compute the square root of this number and store it in the array
CUDA provides a sqrt function as part of its math library, so this
call uses that - not the one from math.h which is a CPU function */
array[idx] = sqrt((double) idx);
}
int main() {
/* store the square roots of 0 to (N-1) on the CPU
* stored on the heap since it's too big for the stack for large values of N */
double* roots = (double*) malloc(N * sizeof(double));
/* allocate a GPU array to hold the square roots */
double* gpu_roots;
cudaMalloc((void**) &gpu_roots, N * sizeof(double));
/* invoke the GPU to calculate the square roots */
square_roots<<<N, 1>>>(gpu_roots);
/* copy the data back */
cudaMemcpy(roots, gpu_roots, N * sizeof(double), cudaMemcpyDeviceToHost);
/* free the memory */
cudaFree(gpu_roots);
/* print out 100 evenly spaced square roots just to see that it worked */
unsigned int i;
for (i = 0; i < N; i += (N / 100)) {
printf("sqrt(%d) = %lf\n", i, roots[i]);
}
/* free the CPU memory */
free(roots);
return 0;
}
```

This program uses an array for storing the square roots. The CPU invokes the GPU to compute the $N$ square roots, then copies the data back to the CPU.

This program will work as long as $N < 65,536$.

If we change $N$ to be a larger value, as in this program:

```
/* 10 million */
#define N 10000000
```

The program will no longer work. It will compile, and run, but every memory spot will equal 0.

The program above *does* actually produce an error, but
we have so far neglected error checking in our CUDA programs.

All CUDA function calls (except kernels) return a `cudaError_t` value
which indicate if there was an error in that function or not.

If the value is equal to `cudaSuccess`, then everything was OK.
Otherwise, there was some type of error, and the program should quit.
The `cudaGetErrorString` function can be used to print the exact error message.

In order to check CUDA kernels, we need two other functions:

`cudaPeekAtLastError`which we can use to check if the call itself was in error.`cudaDeviceSynchronize`which waits for the kernel to finish, and returns the status. This is necessary because CUDA kernels do not block the CPU. This has not been an issue thus far because CUDA will automatically synchronize if we read GPU memory with a call to cudaMemcpy.

The following program includes error checking for every CUDA call.

```
#include <stdio.h>
#include <stdlib.h>
/* 10 million */
#define N 10000000
/* the kernel fills an array up with square roots */
__global__ void square_roots(double* array) {
/* find my array index */
unsigned int idx = blockIdx.x;
/* compute the square root of this number and store it in the array
CUDA provides a sqrt function as part of its math library, so this
call uses that - not the one from math.h which is a CPU function */
array[idx] = sqrt((double) idx);
}
int main() {
/* error status */
cudaError_t status;
/* store the square roots of 0 to (N-1) on the CPU
* stored on the heap since it's too big for the stack for large values of N */
double* roots = (double*) malloc(N * sizeof(double));
/* allocate a GPU array to hold the square roots */
double* gpu_roots;
status = cudaMalloc((void**) &gpu_roots, N * sizeof(double));
if (status != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(status));
return 0;
}
/* invoke the GPU to calculate the square roots */
square_roots<<<N, 1>>>(gpu_roots);
/* check and see if there was an error on the call */
status = cudaPeekAtLastError();
if (status != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(status));
return 0;
}
/* wait for the kernel to complete, and check the resulting status code */
status = cudaDeviceSynchronize();
if (status != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(status));
return 0;
}
/* copy the data back */
status = cudaMemcpy(roots, gpu_roots, N * sizeof(double), cudaMemcpyDeviceToHost);
if (status != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(status));
return 0;
}
/* free the memory */
status = cudaFree(gpu_roots);
if (status != cudaSuccess) {
printf("Error: %s\n", cudaGetErrorString(status));
return 0;
}
/* print out 100 evenly spaced square roots just to see that it worked */
unsigned int i;
for (i = 0; i < N; i += (N / 100)) {
printf("sqrt(%d) = %lf\n", i, roots[i]);
}
/* free the CPU memory */
free(roots);
return 0;
}
```

The error checking in the example above takes more code than the actual logic of interacting with CUDA. A common approach is to use a C macro to simplify the error handling.

A macro provides a shortcut for a longer block of code. We can write a macro as follows:

```
/* an error handling macro */
#define CHECK(x) {cudaError_t code = (x);\
if (code != cudaSuccess) {\
printf("Error in %s, line %d: %s.\n", __FILE__, __LINE__, cudaGetErrorString(code));\
exit(code);}}
```

This greatly simplifies the main function code:

```
int main() {
/* store the square roots of 0 to (N-1) on the CPU
* stored on the heap since it's too big for the stack for large values of N */
double* roots = (double*) malloc(N * sizeof(double));
/* allocate a GPU array to hold the square roots */
double* gpu_roots;
CHECK(cudaMalloc((void**) &gpu_roots, N * sizeof(double)));
/* invoke the GPU to calculate the square roots */
square_roots<<<N, 1>>>(gpu_roots);
/* check and see if there was an error on the call */
CHECK(cudaPeekAtLastError());
/* wait for the kernel to complete, and check the resulting status code */
CHECK(cudaDeviceSynchronize());
/* copy the data back */
CHECK(cudaMemcpy(roots, gpu_roots, N * sizeof(double), cudaMemcpyDeviceToHost));
/* free the memory */
CHECK(cudaFree(gpu_roots));
/* print out 100 evenly spaced square roots just to see that it worked */
unsigned int i;
for (i = 0; i < N; i += (N / 100)) {
printf("sqrt(%d) = %lf\n", i, roots[i]);
}
/* free the CPU memory */
free(roots);
return 0;
}
```

Another benefit here is that the CHECK macro also automatically adds the current file name, and line number to the error message.

It's very much recommended to check all CUDA calls for errors instead of letting them silently fail and produce erroneous results.

Given the fact that we cannot run more than 65,536 blocks, how can we compute more square roots?

One way to do it is to have multiple threads per block.
In this scheme, we will use the *second* parameter in the <<< blocks, threads>>>
kernel call.

CUDA will launch the number of specified threads for each block in the kernel. Then we can calculate which thread we are with the following formula:

$id = blockID \times numBlocks + threadID$

We can launch up to 1,024 threads for each block. How many total threads can we then execute?

The following program implements this:

```
#include <stdio.h>
#include <stdlib.h>
/* number of blocks (max 65,536) */
#define BLOCKS 50000
/* number of threads per block (max 1,024) */
#define THREADS 200
/* N is now the number of blocks times the number of threads
in this case it is 10 million */
#define N (BLOCKS * THREADS)
/* the kernel fills an array up with square roots */
__global__ void square_roots(double* array) {
/* find my array index
this now uses three values:
blockIdx.x which is the block ID we are on
blockDim.x which is the number of blocks in the X dimension
threadIdx.x which is the thread ID we are */
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
/* compute the square root of this number and store it in the array
CUDA provides a sqrt function as part of its math library, so this
call uses that - not the one from math.h which is a CPU function */
array[idx] = sqrt((double) idx);
}
int main() {
/* store the square roots of 0 to (N-1) on the CPU
* stored on the heap since it's too big for the stack for large values of N */
double* roots = (double*) malloc(N * sizeof(double));
/* allocate a GPU array to hold the square roots */
double* gpu_roots;
cudaMalloc((void**) &gpu_roots, N * sizeof(double));
/* invoke the GPU to calculate the square roots
now, we don't run all N blocks, because that may be more than 65,535 */
square_roots<<<BLOCKS, THREADS>>>(gpu_roots);
/* copy the data back */
cudaMemcpy(roots, gpu_roots, N * sizeof(double), cudaMemcpyDeviceToHost);
/* free the memory */
cudaFree(gpu_roots);
/* print out 100 evenly spaced square roots just to see that it worked */
unsigned int i;
for (i = 0; i < N; i += (N / 100)) {
printf("sqrt(%d) = %lf\n", i, roots[i]);
}
/* free the CPU memory */
free(roots);
return 0;
}
```

What if we needed more than $65,536 \times 1,024$ total threads?

We could work around this limitation using multiple dimensional blocks.
Recall that the limit of 65,536 is for *any one block dimension*.

Because CUDA kernels can be three dimensional, our actual limit of parallelism is $65,536^3 \times 1024 = 2^{58} \approx 2.88 \times 10^{17}$.

This number should be large enough for nearly all cases.
If we *did* need more computations than that, we could launch multiple
CUDA kernels in serial.
They would of course not be done in parallel, but it's unlikely any GPU
would ever have that many cores.

Copyright © 2019 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.