Home CPSC 425

CUDA Synchronization

Overview

We have seen how we can use threads to get around the limits CUDA places on blocks.

There are other reasons, however, that we might want to use threads in a CUDA program.

In particular, the various threads working on one CUDA block are able to cooperate much like the threads in a CPU program.


Shared Variables

In order for threads to communicate, CUDA provides shared variables. These variables have the __shared__ directive and are shared amongst all threads in a block.

The default behavior when a variable is declared is "private" meaning that the each thread has its own copy of that variable.


int x;
__shared__ int y;

Here, each thread has its own copy of x, whereas each block has its own copy of y - which is shared by all of the threads in the block.


Example: Threaded Sum

Using shared variables, we can improve upon the parallel sum program that we wrote earlier. In the previous example, the GPU calculated the various partial sums in parallel, but the CPU was responsible for computing the final result.

With shared variables, we can store the partial sums in a shared array. This way, all of the threads will have access to the partial sums so they can work together to produce the final result.

The beginning of this kernel begins the same as before:


__global__ void sum(int* result) {
    /* this array is shared for all of the threads in the block */
    __shared__ int partials[THREADS];

    /* find our start and end points */
    int start = ((END - START) / THREADS) * threadIdx.x;
    int end = start + ((END - START) / THREADS) - 1;

    /* the last core must go all the way to the end */
    if (threadIdx.x == (THREADS - 1)) {
        end = END;
    }

    /* calculate our part into the array of partial sums */
    partials[threadIdx.x] = 0;
    int i;
    for (i = start; i <= end; i++) {
        partials[threadIdx.x] += i;
    }

Next, we perform a reduction where the GPU threads calculate the sum of all of the partial sums using the following procedure:


    /* now we perform a reduction on the partial sums
       start i at the number of threads in the block divided by 2 */
    i = THREADS / 2;
    while (i != 0) {
        /* if we are part of this round */
        if (threadIdx.x < i) {
            /* add the one to our right by i places into this one */
            partials[threadIdx.x] += partials[threadIdx.x + i];
        }

        /* cut i in half */
        i /= 2;
    }

This takes $log_2(N)$ steps where $N$ is the number of threads (this also requires that the number of threads be a power of 2).

The code above will only work when the number of threads is a power of two, and divides evenly at each step. If this is not the case, we need to include some code to add the extra values into our sum. This can be handled like so:


    /* now we perform a reduction on the partial sums
       start i at the number of threads in the block divided by 2 */
    i = THREADS / 2;

    /* if there were an odd number at start, grab the last one */
    if (THREADS % 2 != 0) {
        if (threadIdx.x == 0) {
            partials[0] += partials[THREADS - 1];
        }
    }

    while (i != 0) {
        /* if we are not done */
        if (threadIdx.x < i) {
            /* add the one to our right by i places into this one */
            partials[threadIdx.x] += partials[threadIdx.x + i];
        }

        /* if there is an odd one out, add its sum to 0 */
        if ((i > 1) && (i % 2 != 0)) {
            if (threadIdx.x == 0) {
                partials[0] += partials[i - 1];
           }
        }

        /* wait for all threads to do this */
        __syncthreads();

        /* cut i in half */
        i /= 2;
    }

After this process is finished, the final result will be stored in partials[0], so we have one thread write it back to the memory passed in from the CPU:


    /* now array slot 0 should hold the final answer - have one thread write it to the output cell */
    if (threadIdx.x == 0) {
        *result = partials[0];
    }

All threads could have done this, but there is no need.

The full code for this program is given here.

Unfortunately, this program does not run correctly!


Block Synchronization

The reason that this program fails is the same reason our first Pthreads sum programs failed - when we have shared variables, we need to provide synchronization of some kind.

The simplest form of synchronization in CUDA is the __syncthreads() function which works like a barrier. Once one thread reaches the __syncthreads() call, it will wait until all threads have reached it.

Where do we need to put in calls to __syncthreads() to fix the program above?

The fixed program is available here.


Atomic Functions

Recall the OpenMP atomic directive which says that the following operation should be done atomically - as one indivisible step which can't be interrupted by any other thread.

If we want to increment a shared variable in all of the threads, we cannot simply do so like this:


__shared__ int x = 0;

/* ... */

/* increment x */
x++;

The reason for this is that it's possible that the addition happens at the same time in multiple threads and some of the updates will be lost.

We can however do this with an atomic operation. CUDA provides the atomicAdd function for exactly this purpose:


int atomicAdd(int* address, int value);

This function takes the address of variable, and some value. It then adds that value into the integer stored at the address, writing the result back. It also returns the original value of the integer stored at the address - before the addition.

This allows us to increment a variable safely:

atomicAdd(&x, 1);

Other atomic operations include:


Implementing Mutexes in CUDA

We have seen how we can use __syncthreads to do synchronization of CUDA threads. Another synchronization primitive we might want is a mutex.

CUDA does not have mutexes built in, but we can build them with shared variables and atomic operations.

To do this, we will use a shared integer variable to represent the mutex. We will say that a value of 0 represents an unlocked mutex, and a value of 1 represents a locked mutex.


/* create a "mutex" and say that 0 is unlocked and 1 is locked */
__shared__ int mutex = 0;

Now we can write functions to lock and unlock the mutex:


__device__ void lock(int* mutex) {
    /* compare mutex to 0.
       when it equals 0, set it to 1
       we will break out of the loop after mutex gets set to 1 */
    while (atomicCAS(mutex, 0, 1) != 0) {
        /* do nothing */
    }
}

We can write another function to unlock a mutex by setting it to 0:


__device__ void unlock(int* mutex) {
    atomicExch(mutex, 0);
}

The lock function uses busy-waiting which is not ideal. With a GPU, however, each individual core is less important than a CPU core, so busy-waiting isn't quite as bad.

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.