The "von Neumann bottleneck" is the interconnect between the CPU and main memory.
Attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions.
Overlapping the execution of multiple instructions in different stages.
Starting several instructions at the same time.
Instruction level parallelism happens automatically in hardware, it's not what we will be concerned with in this class.
A process is an instance of a program being executed. Contains:
The main job of the operating system is to run multiple processes concurrently.
A program can also launch extra processes for itself, but each has distinct memory, so sharing of data must be done manually.
Threads are contained within processes. All threads of a process share one address space and can access the same data. All mutual exclusion must be done manually.
Caches help alleviate the von Neumann bottleneck. They involve one or more levels of memory closer to the CPU.
Below is an image of the cache layout of an Intel i7 chip:
The following table gives typical sizes and access speeds of cache levels:
|Memory Type||Typical Size||Typical Speed|
|L1 Cache||32 KB||4 cycles|
|L2 Cache||256 KB||10 cycles|
|L3 Cache||8 MB||50 cycles|
|Main Memory||8 GB||800 cycles|
When doing a memory access, the following process happens:
Early SIMD machines were called "vector processors".
The idea lives on in GPUs and media processors.
Graphics processing units were originally created for rendering graphics quickly. This involves a few common operations:
These operations also must be applied to large numbers of vertices or pixels, opening up the possibility of data parallelism.
These capabilities are great for many other computational tasks.
GPUs are much different than CPUs:
MIMD is more general than SIMD as each core can execute different instructions.
MIMD parallel machines are broken down in terms of how memory is accessed:
The memory system has a huge impact on how to program the system effectively.
Supercomputers are large clusters of powerful computer systems that combine shared memory systems which are networked together. The most powerful supercomputer in the world is currently the "Sunway TaihuLight" in China, which has 10,649,600 total CPU cores.
We will look at two major ways of doing parallel programming:
This involves creating multiple threads in one process. It can be used to write programs for shared memory systems, and makes sharing data easy.
This involves creating multiple processes. It can be used to write programs for shared memory or distributed memory systems, and sharing data must be done more explicitly.
Copyright © 2018 Ian Finlayson | Licensed under a Creative Commons Attribution 4.0 International License.