# More CUDA Programming

## CUDA Sum Example

An example of parallel computing that we have seen for Pthreads, OpenMP, and MPI is the partial sum example.

Here, each thread or process is responsible for computing the sum of a range of numbers. Then, each partial sum is added together to make the final answer.

This can be done in CUDA in the following manner:

1. Allocate space on the GPU to hold the partial sums.
2. Call a __global__ function on the GPU to calculate partial sums in parallel.
3. Copy the partial sums from GPU memory to the CPU.
4. Have the CPU compute the final answer.

/* sum.cu */

#include <unistd.h>
#include <stdio.h>

#define CORES 16

#define START 0
#define END 100

/* this is the GPU kernel function */
__global__ void sum(int* partials) {
/* find our start and end points */
int start = ((END - START) / CORES) * blockIdx.x;
int end = start + ((END - START) / CORES) - 1;

/* the last core must go all the way to the end */
if (blockIdx.x == (CORES - 1)) {
end = END;
}

/* calculate our part into the array of partial sums */
partials[blockIdx.x] = 0;
int i;
for (i = start; i <= end; i++) {
partials[blockIdx.x] += i;
}
}

/* the main function begins running on the CPU */
int main() {
/* space to store partials on CPU */
int cpu_partials[CORES];

/* allocate space on the GPU for the partial ints */
int* partials;
cudaMalloc((void**) &partials, CORES * sizeof(int));

/* invoke the GPU to run the kernel in parallel
we specify CORES cores which each run once */
sum<<<CORES, 1>>>(partials);

/* copy the partial sums back from the GPU to the CPU */
cudaMemcpy(cpu_partials, partials, CORES * sizeof(int), cudaMemcpyDeviceToHost);

/* free the memory we allocated on the GPU */
cudaFree(partials);

/* now find the final answer on the CPU */
for (i = 0; i < CORES; i++) {
}

/* print the final result */
return 0;
}


## __device__ Functions

As we have seen, CUDA __global__ functions may not call CPU functions. The __global__ functions we have seen are simple enough that having them exist as single entities is OK.

However, as our GPU programs get more sophisticated, we will need to split the code into multiple functions.

We can solve this problem with __device__ functions. These are functions that begin with the __device__ keyword which specifies that the function is CUDA code and should be compiled to run on the GPU.

These functions can be called from the __global__ function. They can take any data as parameters, and return any data type as well.

The following example use __device__ functions to split the logic of "islower" and "toupper" out of the hello world example seen last time:


/* hello-functions.cu */

#include <unistd.h>
#include <stdio.h>

#define N 16
#define CORES 16

/* a GPU function for islower */
__device__ int islower(char c) {
return (c >= 'a') && (c <= 'z');
}

/* a GPU function for toupper */
__device__ char toupper(char c) {
return c - 32;
}

/* this is the GPU kernel function */
__global__ void hello(char* s) {
/* blockIdx is a struct containing our block id
if this this is a one-dimensional kernel, then x is the block id
y and z are also available for 2 or 3 dimensional kernels */
int i = blockIdx.x;

/* capitalize the string by capitalizing each lower case letter */
if (islower(s[i])) {
s[i] = toupper(s[i]);
}
}

/* the main function begins running on the CPU */
int main() {
/* this is the string data - it is 'hello world', in lower-case */
char cpu_string[N] = "hello world!";

/* allocate space on the GPU for the string */
char* gpu_string;
cudaMalloc((void**) &gpu_string, N * sizeof(char));

/* send the character array to the GPU */
cudaMemcpy(gpu_string, cpu_string, N * sizeof(char), cudaMemcpyHostToDevice);

/* invoke the GPU to run the kernel in parallel
we specify CORES cores which each run once */
hello<<<CORES, 1>>>(gpu_string);

/* copy the string back from the GPU to the CPU */
cudaMemcpy(cpu_string, gpu_string, N * sizeof(char), cudaMemcpyDeviceToHost);

/* free the memory we allocated on the GPU */
cudaFree(gpu_string);

/* print the string we got back from the GPU */
printf("%s\n", cpu_string);

return 0;
}


## Multi-Dimensional Kernels

In the examples above, we use a one dimensional kernel. CUDA calls our __global__ kernel function once for each of the single items in the array.

It's also possible to use multi-dimensional kernels. This is done by declaring a dim3 variable. This can be used to declare 2-dimensional spaces by passing the two dimensions to the dim3 constructor:


dim3 space(width, height);

This declares a variable called "space" which can then be passed to CUDA when launching a kernel:


kernel<<<space, 1>>>

Now, the kernel will be called width*height times. Also, the blockIdx variable will have its x and y values set according to which block we are calculating.

This can also be extended to three-dimensional blocks:


dim3 space(width, height, depth);

This would be useful for looping through three-dimensional arrays.

The limit on the size of any one dimension is 65,535.

## Julia Set Example

As an example of multi-dimensional kernels, we will look at a program which computes the Julia set.

This is an application of a certain type of function over complex numbers (numbers with a real and imaginary part).

The function applies an iterative process to the values. If they diverge to infinity, they are not in the Julia set. If they remain bounded, they are in the set.

The interesting thing about the Julia set is that, for most values of constants in the function, the set produces chaotic fractals when applied in two dimensions.

The example we will look at applies this function to the X and Y coordinates of an image, and sets the pixel value to be either red if the function returns a 1, or black if the function returns a 0.

Important points of the implementation:

1. The code run on the GPU is larger than the other examples, so 4 __device__ functions are used, in addition to the __global__ kernel function.
2. The chaotic function is the "julia" function. The constants used are mostly arbitrary. Tweaking them will change the generated fractal.
3. The kernel is in two dimensions. CUDA is responsible for allocating GPU cores to run the kernel on. There are many more pixels than cores, so CUDA must give the core multiple points.
4. Note that we do not have code to loop through the X and Y coordinates on the GPU. That is handled automatically by CUDA.
5. The color data is stored on the GPU as the program runs, then copied back afterwards.
6. The PPM image format is used to save the image, so we can see the result.

/* julia.cu */

#include <stdio.h>

/* size of the image width and height in pixels */
#define SIZE 5000

/* an image is a 2D array of color data */
struct Image {
int width, height;
unsigned char* data;
};

/* function to save the image to a PPM file */
void save_image(struct Image* image, const char* fname) {
/* open the file */
FILE* f = fopen(fname, "w");
if (!f) {
fprintf(stderr, "Error, could not open file '%s' for writing.\n", fname);
return;
}

/* write the magic number, size, and max color */
fprintf(f, "P3\n%d %d\n255\n", image->width, image->height);

/* write the image data */
int i, j;
for (i = 0; i < image->height; i++) {
for (j = 0; j < image->width; j++)  {
fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 0]);
fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 1]);
fprintf(f, "%u ", image->data[i*image->width*3 + j*3 + 2]);
}
fprintf(f, "\n");
}

/* close the file */
fclose(f);
}

/* a struct for storing complex numbers */
struct Complex {
float r;
float i;
};

/* get the magnitude of a complex number */
__device__ float magnitude(struct Complex c) {
return c.r * c.r + c.i * c.i;
}

/* multiply two complex numbers */
__device__ struct Complex mult(struct Complex a, struct Complex b) {
struct Complex result;
result.r = a.r * b.r - a.i * b.i;
result.i = a.i * b.r + a.r * b.i;
return result;
}

/* add two complex numbers */
__device__ struct Complex add(struct Complex a, struct Complex b) {
struct Complex result;
result.r = a.r + b.r;
result.i = a.i + b.i;
return result;
}

/* the julia function */
__device__ int julia(int x, int y) {
/* translate the point into complex space */
const float scale = 1.5;
float jx = scale * (float)(SIZE / 2 - x) / (SIZE / 2);
float jy = scale * (float)(SIZE / 2 - y) / (SIZE / 2);

/* arbitrary starting point */
struct Complex c;
c.r = -0.8;
c.i = 0.156;

struct Complex a;
a.r = jx;
a.i = jy;

int i;
/* apply the iterative process */
for (i = 0; i < 200; i++) {
if (magnitude(a) > 1000) {
return 0;
}
}

return 1;
}

/* the main kernel function runs julia on each pixel */
__global__ void kernel(unsigned char* data) {
/* map from blockIdx to pixel position */
int x = blockIdx.x;
int y = blockIdx.y;

int offset = x + y * gridDim.x;

/* calculate the value at this position */
int julia_value = julia(x, y);
data[offset * 3 + 0] = 255 * julia_value;
data[offset * 3 + 1] = 0;
data[offset * 3 + 2] = 0;
}

/* the main function begins running on the CPU */
int main(int argc, char** argv) {
/* create the image */
struct Image image;
image.width = SIZE;
image.height = SIZE;
image.data = (unsigned char*) malloc(SIZE * SIZE * 3 * sizeof(unsigned char));

/* allocate space for the image data */
unsigned char* gpu_image;
cudaMalloc((void**) &gpu_image, SIZE * SIZE * 3 * sizeof(unsigned char));

/* run the kernel */
dim3 grid(SIZE, SIZE);
kernel<<<grid, 1>>>(gpu_image);

/* copy the data back into the image */
cudaMemcpy(image.data, gpu_image, SIZE * SIZE * 3 * sizeof(unsigned char), cudaMemcpyDeviceToHost);

/* free the GPU memory */
cudaFree(gpu_image);

/* save the image */
save_image(&image, "julia.ppm");

return 0;
}


The output of this program is this (very large) image.

The interesting thing about the Julia set is that the relatively simple function gives very interesting output.