Home CPSC 425

Introduction to CUDA

Overview

CUDA is a system that allows us to write programs on the GPU. CUDA programs are basically C or C++ programs, with the extension of extra keywords that allow us to specify parallelism.

Parts of a CUDA program execute on the CPU, while other parts execute on the GPU. CUDA allows us to specify which parts execute on the GPU as well as manage GPU memory.

CUDA was created by NVIDIA and is only supported on NVIDIA GPUs. OpenCL provides a way to write parallel code for multiple types of GPUs, but is more complex than CUDA.


GPU Architecture

Graphics processing units were originally created for rendering graphics quickly. This involves a few common operations:

These operations also must be applied to large numbers of vertices or pixels, opening up the possibility of data parallelism.

These capabilities are great for many other computational tasks.

GPUs are much different than CPUs:

solaria has 2 Geforce GTX 690 cards.


CUDA Programming Model

CUDA programs are composed of two different parts:

An important part of CUDA programming is managing memory. CPU and GPU memory is distinct. The GPU is completely separate from the CPU and is not a part of the cache hierarchy.

For this reason, we must explicitly pass data between the CPU and GPU when writing CUDA programs.


Memory Management

GPU memory is completely separate from CPU memory. In order to allocate memory on the GPU, we call the cudaMalloc function:

cudaError_t cudaMalloc(void** pointer, int bytes)

This function takes the address of a pointer, and the number of bytes to allocate. It creates the memory on the GPU and returns a handle to it in the pointer which is passed in.

This pointer cannot be directly accessed as it doesn't point to a memory address on the CPU. To store anything on the GPU, we must use the cudaMemcpy function:

cudaError_t cudaMemcpy(void* destination, void* source, int bytes, cudaMemcpyKind kind)

This function copies the specified number of bytes from the source location to the destination. "kind" can be one of:

There is not much reason to use "host to host", as the regular memcpy function can be used. For transferring memory to or from the device, however, these functions are necessary.


Blocks

The CUDA programming model involves passing data to the GPU in "blocks" which are arrays of some dimension. CUDA automatically divides these blocks up amongst multiple GPU cores.

The blockIdx variable is available which contains the indices this core is supposed to be processing. If we are processing a 1D array, then blockIdx.x gives us the index. For 2D or 3D arrays, blockIdx.y and blockIdx.z give the second and third dimension.


CUDA Hello World

The following is a hello world program in CUDA:


#include <stdio.h>

#define N 16
#define CORES 16

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if ((s[blockIdx.x] >= 'a') && (s[blockIdx.x] <= 'z')) {
        s[blockIdx.x] -= 32;
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    printf("%s\n", cpu_string);

    return 0;
}

GPUs typically do not do input or output, but only computation. So this hello world program has the GPU "compute" the string "HELLO WORLD" by capitalizing the string stored on the CPU.

Note that there is no loop in the program. Invoking the hello function on 16 cores will automatically launch 16 instances of the function on 16 GPU cores. Each will have a different blockIdx.x so each spot in the array will be processed.


Compiling & Running

CUDA is not installed on cs because it does not have CUDA capable graphics cards. solaria is the only CPSC server which can use CUDA.

You should be able to login to solaria from cs by entering the command:

$ ssh solaria

Now we can use the CUDA compiler to compile and run our hello world program:

$ nvcc hello.cu
$ ./a.out

CUDA and C++

CUDA can be used with C++ as easily as it can be used with C.


/* hello.cu */

#include <iostream>
using namespace std;

const int N = 16;
const int CORES = 16;

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if ((s[blockIdx.x] >= 'a') && (s[blockIdx.x] <= 'z')) {
        s[blockIdx.x] -= 32;
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    cout << cpu_string << endl;

    return 0;
}

We can compile and run C++/CUDA code exactly the same as C/CUDA code.

The nvcc program passes all __global__ functions to a special compiler for CUDA, and compiles the rest of the code with a C++ compiler. Because C is nearly a subset of C++, we can use C or C++.


__global__ Functions

The "regular" functions in a CUDA program can be any arbitrary C or C++ code.

The __global__ functions, are CUDA code, however. There are some limitations in what can be used in these. The following program is identical to the Hello World program above, except that it uses the C functions isupper and toupper:


#include <unistd.h>
#include <ctype.h>
#include <stdio.h>

#define N 16
#define CORES 16

/* this is the GPU kernel function */
__global__ void hello(char* s) {
    /* blockIdx is a struct containing our block id
       if this this is a one-dimensional kernel, then x is the block id
       y and z are also available for 2 or 3 dimensional kernels */

    /* capitalize the string by subtracting 32 from each lowercase letter */
    if (isupper(s[blockIdx.x])) {
        s[blockIdx.x] = toupper(s[blockIdx.x]);
    }
}

/* the main function begins running on the CPU */
int main() {
    /* this is the string data - it is 'hello world', in lower-case */
    char cpu_string[N] = "hello world!";

    /* allocate space on the GPU for the string */
    char* gpu_string;
    cudaMalloc((void**) &gpu_string, N * sizeof(char));

    /* send the character array to the GPU */
    cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

    /* invoke the GPU to run the kernel in parallel
       we specify CORES cores which each run once */
    hello<<<CORES, 1>>>(gpu_string);

    /* copy the string back from the GPU to the CPU */
    cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

    /* free the memory we allocated on the GPU */
    cudaFree(gpu_string);

    /* print the string we got back from the GPU */
    printf("%s\n", cpu_string);

    return 0;
}

This code does not compile however:

hello-fail.cu(17): error: calling a __host__ function("isupper") from a __global__ function("hello") is not allowed

hello-fail.cu(18): error: calling a __host__ function("toupper") from a __global__ function("hello") is not allowed

The CUDA code in a __global__ function cannot call functions on the CPU. We can however, call printf on the GPU because that function has a special CUDA implementation.

CUDA code otherwise can be any C or C++ code with the following exceptions:

Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.