CUDA Continued

CUDA Error Checking

All CUDA function calls (except kernels) return a cudaError_t value which indicate if there was an error in that function or not.

If the value is equal to cudaSuccess, then everything was OK. Otherwise, there was some type of error, and the program should quit. The cudaGetErrorString function can be used to print the exact error message.

In order to check CUDA kernels, we need two other functions:

cudaPeekAtLastError which we can use to check if the call itself was in error.
cudaDeviceSynchronize which waits for the kernel to finish, and returns the status. This is necessary because CUDA kernels do not block the CPU. This has not been an issue thus far because CUDA will automatically synchronize if we read GPU memory with a call to cudaMemcpy.

You can see what checking all of the errors looks like in this program.

A Macro for Error Checking

The error checking in the example above takes more code than the actual logic of interacting with CUDA. A common approach is to use a C macro to simplify the error handling.

A macro provides a shortcut for a longer block of code. We can write a macro as follows:


/* an error handling macro */
#define CHECK(x) {cudaError_t code = (x);\
    if (code != cudaSuccess) {\
        printf("Error in %s, line %d: %s.\n", __FILE__, __LINE__, cudaGetErrorString(code));\
        exit(code);}}

This greatly simplifies the main function code:


int main() {
    /* store the square roots of 0 to (N-1) on the CPU
     * stored on the heap since it's too big for the stack for large values of N */
    double* roots = (double*) malloc(N * sizeof(double));

    /* allocate a GPU array to hold the square roots */
    double* gpu_roots;
    CHECK(cudaMalloc((void**) &gpu_roots, N * sizeof(double)));

    /* invoke the GPU to calculate the square roots */
    square_roots<<<N, 1>>>(gpu_roots);

    /* check and see if there was an error on the call */
    CHECK(cudaPeekAtLastError());

    /* wait for the kernel to complete, and check the resulting status code */
    CHECK(cudaDeviceSynchronize());

    /* copy the data back */
    CHECK(cudaMemcpy(roots, gpu_roots, N * sizeof(double), cudaMemcpyDeviceToHost));

    /* free the memory */
    CHECK(cudaFree(gpu_roots));

    /* print out 100 evenly spaced square roots just to see that it worked */
    unsigned int i;
    for (i = 0; i < N; i += (N / 100)) {
        printf("sqrt(%d) = %lf\n", i, roots[i]);
    }

    /* free the CPU memory */
    free(roots);

    return 0;
}

Another benefit here is that the CHECK macro also automatically adds the current file name, and line number to the error message.

It's very much recommended to check all CUDA calls for errors instead of letting them silently fail and produce erroneous results.

Shared Variables

In order for multiple CUDA threads to communicate, CUDA provides shared variables. These variables have the __shared__ directive and are shared amongst all threads in a block.

The default behavior when a variable is declared is "private" meaning that the each thread has its own copy of that variable.


int x;
__shared__ int y;

Here, each thread has its own copy of x, whereas each block has its own copy of y - which is shared by all of the threads in the block.

Example: Threaded Sum

Using shared variables, we can improve upon the parallel sum program that we wrote earlier. In the previous example, the GPU calculated the various partial sums in parallel, but the CPU was responsible for computing the final result.

With shared variables, we can store the partial sums in a shared array. This way, all of the threads will have access to the partial sums so they can work together to produce the final result.

The beginning of this kernel begins the same as before:


__global__ void sum(int* result) {
    /* this array is shared for all of the threads in the block */
    __shared__ int partials[THREADS];

    /* find our start and end points */
    int start = ((END - START) / THREADS) * threadIdx.x;
    int end = start + ((END - START) / THREADS) - 1;

    /* the last core must go all the way to the end */
    if (threadIdx.x == (THREADS - 1)) {
        end = END;
    }

    /* calculate our part into the array of partial sums */
    partials[threadIdx.x] = 0;
    int i;
    for (i = start; i <= end; i++) {
        partials[threadIdx.x] += i;
    }

Next, we perform a reduction where the GPU threads calculate the sum of all of the partial sums using the following procedure:


    /* now we perform a reduction on the partial sums
       start i at the number of threads in the block divided by 2 */
    i = THREADS / 2;
    while (i != 0) {
        /* if we are part of this round */
        if (threadIdx.x < i) {
            /* add the one to our right by i places into this one */
            partials[threadIdx.x] += partials[threadIdx.x + i];
        }

        /* cut i in half */
        i /= 2;
    }

This takes $log_2(N)$ steps where $N$ is the number of threads (this also requires that the number of threads be a power of 2).

The code above will only work when the number of threads is a power of two, and divides evenly at each step. If this is not the case, we need to include some code to add the extra values into our sum. This can be handled like so:


    /* now we perform a reduction on the partial sums
       start i at the number of threads in the block divided by 2 */
    i = THREADS / 2;

    /* if there were an odd number at start, grab the last one */
    if (THREADS % 2 != 0) {
        if (threadIdx.x == 0) {
            partials[0] += partials[THREADS - 1];
        }
    }

    while (i != 0) {
        /* if we are not done */
        if (threadIdx.x < i) {
            /* add the one to our right by i places into this one */
            partials[threadIdx.x] += partials[threadIdx.x + i];
        }

        /* if there is an odd one out, add its sum to 0 */
        if ((i > 1) && (i % 2 != 0)) {
            if (threadIdx.x == 0) {
                partials[0] += partials[i - 1];
           }
        }

        /* wait for all threads to do this */
        __syncthreads();

        /* cut i in half */
        i /= 2;
    }

After this process is finished, the final result will be stored in partials[0], so we have one thread write it back to the memory passed in from the CPU:


    /* now array slot 0 should hold the final answer - have one thread write it to the output cell */
    if (threadIdx.x == 0) {
        *result = partials[0];
    }

All threads could have done this, but there is no need.

The full code for this program is given here.

Unfortunately, this program does not run correctly!

Block Synchronization

The reason that this program fails is the same reason our first Pthreads sum programs failed - when we have shared variables, we need to provide synchronization of some kind.

The simplest form of synchronization in CUDA is the __syncthreads() function which works like a barrier. Once one thread reaches the __syncthreads() call, it will wait until all threads have reached it.

Where do we need to put in calls to __syncthreads() to fix the program above?

The fixed program is available here.

Atomic Functions

Recall the OpenMP atomic directive which says that the following operation should be done atomically - as one indivisible step which can't be interrupted by any other thread.

If we want to increment a shared variable in all of the threads, we cannot simply do so like this:


__shared__ int x = 0;

/* ... */

/* increment x */
x++;

The reason for this is that it's possible that the addition happens at the same time in multiple threads and some of the updates will be lost.

We can however do this with an atomic operation. CUDA provides the atomicAdd function for exactly this purpose:


int atomicAdd(int* address, int value);

This function takes the address of variable, and some value. It then adds that value into the integer stored at the address, writing the result back. It also returns the original value of the integer stored at the address - before the addition.

This allows us to increment a variable safely:

atomicAdd(&x, 1);

Other atomic operations include:

atomicSub(address, value), subtract the value from the variable at variable.
atomicExch(address, value), set the variable at address to value.
atomicMin(address, value), compare the variable at address with value and set the variable at address to the minimum of the two.
atomicMax(address, value), compare the variable at address with value and set the variable at address to the maximum of the two.
atomicCAS(address, compare, value), checks if the variable at address is equal to compare. If so, sets the variable at address to value. Returns the original value of the variable at address. CAS stands for compare and swap.

Implementing Mutexes in CUDA

We have seen how we can use __syncthreads to do synchronization of CUDA threads. Another synchronization primitive we might want is a mutex.

CUDA does not have mutexes built in, but we can build them with shared variables and atomic operations.

To do this, we will use a shared integer variable to represent the mutex. We will say that a value of 0 represents an unlocked mutex, and a value of 1 represents a locked mutex.


/* create a "mutex" and say that 0 is unlocked and 1 is locked */
__shared__ int mutex = 0;

Now we can write functions to lock and unlock the mutex:


__device__ void lock(int* mutex) {
    /* compare mutex to 0.
       when it equals 0, set it to 1
       we will break out of the loop after mutex gets set to 1 */
    while (atomicCAS(mutex, 0, 1) != 0) {
        /* do nothing */
    }
}

We can write another function to unlock a mutex by setting it to 0:


__device__ void unlock(int* mutex) {
    atomicExch(mutex, 0);
}

The lock function uses busy-waiting which is not ideal. With a GPU, however, each individual core is less important than a CPU core, so busy-waiting isn't quite as bad.