Home CPSC 305

Parallelism

 

Overview

Nearly all computers these days are parallel machines. Even phones and tablets are multi-core computers, and include multi-core GPUs as well. Only the smallest of micro-controllers are single-core machines these days.

Today, we will look at some issues involved with parallel computer hardware.


 

Moore's Law and Parallelism

Recall that Moore's law says that the number of transistors in a fixed sized integrated circuit doubles every two years. Below is a graph which demonstrates Moore's law:


Moore's law has continued up to the present day, but has slowed down. In 2022 Nvidia CEO Jensen Huang stated that Moore's law was dead, while others in the industry disagree. There is no consensus on when it will end, but likely soon.

For most of the duration of Moore's law, hardware designers have packed transistors smaller so as to increase the processor clock speed.

Below is a graph of clock speeds of Intel processors:

Why did the clock speeds level off in 2006? Why have they still not increased past this point?



Because processors can't continue to increase the clock speed any longer, we use the increased number of transistors Moore's law gives us primarily to do one of two things:

  1. Include more and bigger caches.
  2. Include more processor cores on chip.

 

Types of Parallelism


 

Hardware Multi-threading

In multi-threading, the processor alternates between multiple threads, which may or may not be a part of the same process. There are two major approaches to switching between multiple threads:

All processors must be able to do context switching, because the number of threads is unbounded, but also supporting fine-grained multi-threading offers increased performance.


 

Multi-core and Caching

Different computer systems use different strategies for splitting caches up amongst multiple cores. The i7, as shown below, gives each core its own L1 and L2 cache, but the cores all share one L3 cache:


 

Cache Writing Policy

There is another complication of caching that comes into play when we have multiple cores sharing one cache. That is the question of what should really happen when values in the cache are written.

Recall that cached values also exist in higher level caches, and also in main memory. When the processor modifies a value, what should be done with the corresponding entires in main memory (or higher cache levels)?


 

Cache Coherence

In a multi-core setup, caching can cause additional issues. What happens if two processors both have a copy of the same value in private caches:

What if both processors write to their copy of x?

Which one should be written back to main memory?

A cache coherence policy attempts to manage these conflicts.

Approaches to cache coherence include:

What are the trade-offs between these approaches?

Both of these incur a performance penalty. Even without synchronization, sharing data between threads can result in performance degradation.


 

Cache Coherence Example

Cache coherence has a huge impact on performance and is something programmers must be aware of to write efficient parallel code. Below is part of a program that estimates Pi using multiple threads:


#define THREADS 4
#define TOSSES 50000000

/* global array of random seeds */
int seeds[THREADS];

/* the function called for each thread */
void* estimate_pi(void* idp) {
    int number_in_circle = 0, toss;

    /* set the seed to our thread id
     * this is a pointer to the global array */
    unsigned int* seed = &seeds[*(int*)idp];

    /* loop through each toss */
    for (toss = 0; toss < TOSSES; toss++) {
        double x = ((double) rand_r(seed) / (double) RAND_MAX) * 2.0 - 1.0;
        double y = ((double) rand_r(seed) / (double) RAND_MAX) * 2.0 - 1.0;
        double dist_squared = x*x + y*y;

        if (dist_squared <= 1) {
            number_in_circle++;
        }
    }

    /* calculater this thread's pi estimate */
    double* pi_estimate = malloc(sizeof(double));
    *pi_estimate = (4 * number_in_circle) / (double) TOSSES;

    pthread_exit(pi_estimate);
}

The program (which uses pthreads), uses random numbers in order to calculate Pi. For the random seeds, which are read every loop iteration, it uses a global array, with one slot for each thread.

The problem is that this small array will fit into one cache line. Therefore we will have coherence problems. The hardware will need to synchronize the values of that cache line across all threads hurting performance.

How long does the program take to execute?

In order to greatly speed up the program, we need to have each thread store its seed in its own cache line. This can be done by replacing the pointer with a local variable in the function:


/* set the seed to our thread id
 * this is a LOCAL variable */
unsigned int seed = seeds[*(int*)idp];

Now there is no cache coherence issue. Each thread's seed variable is in a separate cache line, so they do not need to be synchronized. How much faster does the program run now?


 

GPU's

Graphics processing units (GPUs) are another type of processor that most computer systems contain. GPUs are primarily used to render graphics onto the screen, but can also now run arbitrary programs.

GPUs are quite different from CPUs:

CPUGPU
GoalLatency - finish each individual task quickly. Throughput - finish a large number of tasks quickly.
Typical Cores2 to 8Thousands
Typical SpeedAround 2-4 GHzUp to 1 GHz
Core IndependenceEach core runs independently. Many cores run the same programs.

GPU cores do not run thousands of independent programs. Instead, they are grouped, and each group runs the same program, but on different data. This works well for graphics when the same transformations are applied to every pixel on the screen.

GPUs are also good for other tasks where the same computations have to be applied across a large amount of data.

Copyright © 2024 Ian Finlayson | Licensed under a Creative Commons BY-NC-SA 4.0 License.