Home CPSC 425

Parallel Hardware


The von Neumann Model

Instruction cycle:

  1. Read instruction from memory.
  2. Decode instruction.
  3. Execute instruction.
  4. Write results to memory.

The "von Neumann bottleneck" is the interconnect between the CPU and main memory.


Instruction Level Parallelism

Attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions.

Instruction level parallelism happens automatically in hardware, it's not what we will be concerned with in this class.


Processes & Threads

A process is an instance of a program being executed. Contains:

The main job of the operating system is to run multiple processes concurrently.

A program can also launch extra processes for itself, but each has distinct memory, so sharing of data must be done manually.

Threads are contained within processes. All threads of a process share one address space and can access the same data. All mutual exclusion must be done manually.



Caches help alleviate the von Neumann bottleneck. They involve one or more levels of memory closer to the CPU.

Below is an image of the cache layout of an Intel i7 chip:

The following table gives typical sizes and access speeds of cache levels:

Memory TypeTypical SizeTypical Speed
L1 Cache32 KB4 cycles
L2 Cache256 KB 10 cycles
L3 Cache8 MB50 cycles
Main Memory8 GB800 cycles

When doing a memory access, the following process happens:

  1. The first level cache is searched.
  2. If it's not there, the second level cache is searched.
  3. If it's not there, the third level cache is searched.
  4. If it's not there, the location is accessed in main memory.


Flynn's Taxonomy



Early SIMD machines were called "vector processors".

The idea lives on in GPUs and media processors.



Graphics processing units were originally created for rendering graphics quickly. This involves a few common operations:

These operations also must be applied to large numbers of vertices or pixels, opening up the possibility of data parallelism.

These capabilities are great for many other computational tasks.

GPUs are much different than CPUs:



MIMD is more general than SIMD as each core can execute different instructions.

MIMD parallel machines are broken down in terms of how memory is accessed:

The memory system has a huge impact on how to program the system effectively.



Supercomputers are large clusters of powerful computer systems that combine shared memory systems which are networked together. The most powerful supercomputer in the world is currently the the Frontier system at Oak Ridge National Laboratory in Tennessee, which has 8,730,112 total CPU cores.


Programming Parallel Machines

We will look at two major ways of doing parallel programming:

Copyright © 2024 Ian Finlayson | Licensed under a Attribution-NonCommercial 4.0 International License.