Introduction to CUDA

Overview

CUDA is a system that allows us to write programs on the GPU. CUDA programs are basically C or C++ programs, with the extension of extra keywords that allow us to specify parallelism.

Parts of a CUDA program execute on the CPU, while other parts execute on the GPU. CUDA allows us to specify which parts execute on the GPU as well as manage GPU memory.

CUDA was created by NVIDIA and is only supported on NVIDIA GPUs. OpenCL provides a way to write parallel code for multiple types of GPUs, but is more complex than CUDA.

GPU Architecture

Graphics processing units were originally created for rendering graphics quickly. This involves a few common operations:

Matrix multiplications
Cross products
Dot products

These operations also must be applied to large numbers of vertices or pixels, opening up the possibility of data parallelism.

These capabilities are great for many other computational tasks.

GPUs are much different than CPUs:

Many more cores.
Slower clock speed.
Less independence amongst cores.

CUDA Programming Model

CUDA programs are composed of two different parts:

Code that runs on the "Host" (the CPU).
Code that runs on the "Device" (the GPU).

An important part of CUDA programming is managing memory. CPU and GPU memory is distinct. The GPU is completely separate from the CPU and is not a part of the cache hierarchy.

For this reason, we must explicitly pass data between the CPU and GPU when writing CUDA programs.

Memory Management

GPU memory is completely separate from CPU memory. In order to allocate memory on the GPU, we call the cudaMalloc function:

cudaError_t cudaMalloc(void** pointer, int bytes)

This function takes the address of a pointer, and the number of bytes to allocate. It creates the memory on the GPU and returns a handle to it in the pointer which is passed in.

This pointer cannot be directly accessed as it doesn't point to a memory address on the CPU. To store anything on the GPU, we must use the cudaMemcpy function:

cudaError_t cudaMemcpy(void* destination, void* source, int bytes, cudaMemcpyKind kind)

This function copies the specified number of bytes from the source location to the destination. "kind" can be one of:

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice

There is not much reason to use "host to host", as the regular memcpy function can be used. For transferring memory to or from the device, however, these functions are necessary.

Blocks

The CUDA programming model involves passing data to the GPU in "blocks" which are arrays of some dimension. CUDA automatically divides these blocks up amongst multiple GPU cores.

The blockIdx variable is available which contains the indices this core is supposed to be processing. If we are processing a 1D array, then blockIdx.x gives us the index. For 2D or 3D arrays, blockIdx.y and blockIdx.z give the second and third dimension.

CUDA Hello World

The following is a hello world program in CUDA:


#include <stdio.h>

#define N 16
#define CORES 16

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if ((s[blockIdx.x] >= 'a') && (s[blockIdx.x] <= 'z')) {
        s[blockIdx.x] -= 32;
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    printf("%s\n", cpu_string);

    return 0;
}

GPUs typically do not do input or output, but only computation. So this hello world program has the GPU "compute" the string "HELLO WORLD" by capitalizing the string stored on the CPU.

Note that there is no loop in the program. Invoking the hello function on 16 cores will automatically launch 16 instances of the function on 16 GPU cores. Each will have a different blockIdx.x so each spot in the array will be processed.

Compiling & Running

CUDA is not installed on the cpsc server because it does not have CUDA capable graphics cards.

On a machine that does have the CUDA compiler, we can compile and run our hello world program:

$ nvcc hello.cu
$ ./a.out

CUDA and C++

CUDA can be used with C++ as easily as it can be used with C.


/* hello.cu */

#include <iostream>
using namespace std;

const int N = 16;
const int CORES = 16;

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if ((s[blockIdx.x] >= 'a') && (s[blockIdx.x] <= 'z')) {
        s[blockIdx.x] -= 32;
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    cout << cpu_string << endl;

    return 0;
}

We can compile and run C++/CUDA code exactly the same as C/CUDA code.

The nvcc program passes all __global__ functions to a special compiler for CUDA, and compiles the rest of the code with a C++ compiler. Because C is nearly a subset of C++, we can use C or C++.

global Functions

The "regular" functions in a CUDA program can be any arbitrary C or C++ code.

The __global__ functions, are CUDA code, however. There are some limitations in what can be used in these. The following program is identical to the Hello World program above, except that it uses the C functions isupper and toupper:


#include <unistd.h>
#include <ctype.h>
#include <stdio.h>

#define N 16
#define CORES 16

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if (isupper(s[blockIdx.x])) {
        s[blockIdx.x] = toupper(s[blockIdx.x]);
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    printf("%s\n", cpu_string);

    return 0;
}

This code does not compile however:

hello-fail.cu(17): error: calling a __host__ function("isupper") from a __global__ function("hello") is not allowed

hello-fail.cu(18): error: calling a __host__ function("toupper") from a __global__ function("hello") is not allowed

The CUDA code in a __global__ function cannot call functions on the CPU. We can however, call printf on the GPU because that function has a special CUDA implementation.

CUDA code otherwise can be any C or C++ code with the following exceptions:

No RTTI (run time type information).
No exception handling.
No virtual functions.

CUDA Sum Example

An example of parallel computing that we have seen for Pthreads, OpenMP, and MPI is the partial sum example.

Here, each thread or process is responsible for computing the sum of a range of numbers. Then, each partial sum is added together to make the final answer.

This can be done in CUDA in the following manner:

Allocate space on the GPU to hold the partial sums.
Call a __global__ function on the GPU to calculate partial sums in parallel.
Copy the partial sums from GPU memory to the CPU.
Have the CPU compute the final answer.

The following code demonstrates this:


/* sum.cu */

#include <unistd.h>
#include <stdio.h>

#define CORES 16 

#define START 0
#define END 100

/* this is the GPU kernel function */
__global__ void sum(int* partials) {
    /* find our start and end points */
    int start = ((END - START) / CORES) * blockIdx.x;
    int end = start + ((END - START) / CORES) - 1;

    /* the last core must go all the way to the end */
    if (blockIdx.x == (CORES - 1)) {
        end = END;
    }

    /* calculate our part into the array of partial sums */
    partials[blockIdx.x] = 0;
    int i;
    for (i = start; i <= end; i++) {
        partials[blockIdx.x] += i;
    }
}

/* the main function begins running on the CPU */
int main() {
    /* space to store partials on CPU */
    int cpu_partials[CORES];

    /* allocate space on the GPU for the partial ints */
    int* partials;
    cudaMalloc((void**) &partials, CORES * sizeof(int));

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    sum<<<CORES, 1>>>(partials);

    /* copy the partial sums back from the GPU to the CPU */
    cudaMemcpy(cpu_partials, partials, CORES * sizeof(int), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(partials);

    /* now find the final answer on the CPU */
    int answer = 0, i;
    for (i = 0; i < CORES; i++) {
        answer += cpu_partials[i];
    }

    /* print the final result */
    printf("Final answer = %d.\n", answer); 
    return 0;
}

device Functions

As we have seen, CUDA __global__ functions may not call CPU functions. The __global__ functions we have seen are simple enough that having them exist as single entities is OK.

However, as our GPU programs get more sophisticated, we will need to split the code into multiple functions.

We can solve this problem with __device__ functions. These are functions that begin with the __device__ keyword which specifies that the function is CUDA code and should be compiled to run on the GPU.

These functions can be called from the __global__ function. They can take any data as parameters, and return any data type as well.

The following example use __device__ functions to split the logic of "islower" and "toupper" out of the hello world example seen last time:


/* hello-functions.cu */

#include <unistd.h>
#include <stdio.h>

#define N 16
#define CORES 16 

/* a GPU function for islower */
__device__ int islower(char c) {
    return (c >= 'a') && (c <= 'z');
}

/* a GPU function for toupper */
__device__ char toupper(char c) {
    return c - 32;
}

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */
    int i = blockIdx.x;

    /* capitalize the string by capitalizing each lower case letter */
    if (islower(s[i])) {
        s[i] = toupper(s[i]);
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    printf("%s\n", cpu_string);

    return 0;
}

Multi-Dimensional Kernels

In the examples above, we use a one dimensional kernel. CUDA calls our __global__ kernel function once for each of the single items in the array.

It's also possible to use multi-dimensional kernels. This is done by declaring a dim3 variable. This can be used to declare 2-dimensional spaces by passing the two dimensions to the dim3 constructor:


dim3 space(width, height);

This declares a variable called "space" which can then be passed to CUDA when launching a kernel:


kernel<<<space, 1>>>

Now, the kernel will be called width*height times. Also, the blockIdx variable will have its x and y values set according to which block we are calculating.

This can also be extended to three-dimensional blocks:


dim3 space(width, height, depth);

This would be useful for looping through three-dimensional arrays.

The limit on the size of any one dimension is 65,535.

Julia Set Example

As an example of multi-dimensional kernels, we will look at a program which computes the Julia set.

This is an application of a certain type of function over complex numbers (numbers with a real and imaginary part).

The function applies an iterative process to the values. If they diverge to infinity, they are not in the Julia set. If they remain bounded, they are in the set.

The interesting thing about the Julia set is that, for most values of constants in the function, the set produces chaotic fractals when applied in two dimensions.

The example we will look at applies this function to the X and Y coordinates of an image, and sets the pixel value to be either red if the function returns a 1, or black if the function returns a 0.

Important points of the implementation:

The code run on the GPU is larger than the other examples, so 4 __device__ functions are used, in addition to the __global__ kernel function.
The chaotic function is the "julia" function. The constants used are mostly arbitrary. Tweaking them will change the generated fractal.
The kernel is in two dimensions. CUDA is responsible for allocating GPU cores to run the kernel on. There are many more pixels than cores, so CUDA must give the core multiple points.
Note that we do not have code to loop through the X and Y coordinates on the GPU. That is handled automatically by CUDA.
The color data is stored on the GPU as the program runs, then copied back afterwards.
The PPM image format is used to save the image, so we can see the result.

The full code example is below:


/* julia.cu */

#include <stdio.h>

/* size of the image width and height in pixels */
#define SIZE 5000

/* an image is a 2D array of color data */
struct Image {
    int width, height;
    unsigned char* data;
};

/* function to save the image to a PPM file */
void save_image(struct Image* image, const char* fname) {
    /* open the file */
    FILE* f = fopen(fname, "w");
    if (!f) {
        fprintf(stderr, "Error, could not open file '%s' for writing.\n", fname);
        return;
    }

    /* write the magic number, size, and max color */
    fprintf(f, "P3\n%d %d\n255\n", image->width, image->height);

    /* write the image data */
    int i, j;
    for (i = 0; i < image->height; i++) {
        for (j = 0; j < image->width; j++)  {
            fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 0]);
            fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 1]);
            fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 2]);
        }
        fprintf(f, "\n");
    }

    /* close the file */
    fclose(f);
}

/* a struct for storing complex numbers */
struct Complex {
    float r;
    float i;
};

/* get the magnitude of a complex number */
__device__ float magnitude(struct Complex c) {
    return c.r * c.r + c.i * c.i;
}

/* multiply two complex numbers */
__device__ struct Complex mult(struct Complex a, struct Complex b) {
    struct Complex result;
    result.r = a.r * b.r - a.i * b.i;
    result.i = a.i * b.r + a.r * b.i;
    return result;
}

/* add two complex numbers */
__device__ struct Complex add(struct Complex a, struct Complex b) {
    struct Complex result;
    result.r = a.r + b.r;
    result.i = a.i + b.i;
    return result;
}

/* the julia function */
__device__ int julia(int x, int y) {
    /* translate the point into complex space */
    const float scale = 1.5;
    float jx = scale * (float)(SIZE / 2 - x) / (SIZE / 2);
    float jy = scale * (float)(SIZE / 2 - y) / (SIZE / 2);

    /* arbitrary starting point */
    struct Complex c;
    c.r = -0.8;
    c.i = 0.156;

    struct Complex a;
    a.r = jx;
    a.i = jy;

    int i;
    /* apply the iterative process */
    for (i = 0; i < 200; i++) {
        a = add(mult(a, a), c);
        if (magnitude(a) > 1000) {
            return 0;
        }
    }

    return 1;
}

/* the main kernel function runs julia on each pixel */
__global__ void kernel(unsigned char* data) {
    /* map from blockIdx to pixel position */
    int x = blockIdx.x;
    int y = blockIdx.y;

    int offset = x + y * gridDim.x;

    /* calculate the value at this position */
    int julia_value = julia(x, y);
    data[offset * 3 + 0] = 255 * julia_value;
    data[offset * 3 + 1] = 0;
    data[offset * 3 + 2] = 0;
}

/* the main function begins running on the CPU */
int main(int argc, char** argv) {
    /* create the image */
    struct Image image;
    image.width = SIZE;
    image.height = SIZE;
    image.data = (unsigned char*) malloc(SIZE * SIZE * 3 * sizeof(unsigned char));

    /* allocate space for the image data */
    unsigned char* gpu_image;
    cudaMalloc((void**) &gpu_image, SIZE * SIZE * 3 * sizeof(unsigned char));

    /* run the kernel */
    dim3 grid(SIZE, SIZE);
    kernel<<<grid, 1>>>(gpu_image);

    /* copy the data back into the image */
    cudaMemcpy(image.data, gpu_image, SIZE * SIZE * 3 * sizeof(unsigned char), cudaMemcpyDeviceToHost);

    /* free the GPU memory */
    cudaFree(gpu_image);

    /* save the image */
    save_image(&image, "julia.ppm"); 

    return 0;
}

The output of this program is this (very large) image.

The interesting thing about the Julia set is that the relatively simple function gives very interesting output.