Home CPSC 425

Distributed Computing



Multi-threading systems like Pthreads and OpenMP can run on shared memory systems, such as server and desktop computers with multiple CPU cores. In these systems, multiple threads work from within one process.

Multi-processing systems like MPI can also run on multi-core computer systems. Here, multiple processes work together, sharing data when needed by passing messages.

However MPI can also be run on distributed systems which consist of multiple independent computer systems connected together over a network. This can be either a cluster of computers which are connected together on a LAN, or computers working over the wider internet.


Running MPI on a Cluster

Clusters can be created out of computers connected together on a LAN. In order to run parallel programs on a cluster with MPI, one must have a user account on each node of the cluster, and be able to SSH to each node.

Also, the MPI program and all data files must be present on each node. This can be achieved by using a network file system like NFS.

To run a distributed program, we can pass mpirun the --hostfile parameter along with a file containing the hosts which can run processes:

$ mpirun -np 4 --hostfile hosts ./program

Where "hosts" is just a text file containing the hostnames, or IP addresses of machines which can launch processes, one per line. For instance to run this program on 4 nodes with hostnames, we could use the following hostfile:


or we could use IP addresses instead:

We can also run processes across a cluster, and also have multiple processes running on each node. For instance, if there are 4 machines which each have 2 processor cores, we could use the following hostfile:

node0 slots=2
node1 slots=2
node2 slots=2
node3 slots=2

Now MPI will know that there are 8 total processor cores it can utilize.

It's possible that we will want more processes running than we have resources for. What happens if we try to launch 16 processes on this cluster with only 8 actual CPUs? MPI will launch 4 on each node of the cluster, they will just have to share execution time.


Performance Considerations

On our shared memory system, the cs server, messages are sent relatively quickly, since they are sent through the memory hierarchy of a single computer system. For instance, the following program computes whether a large number of integers are prime, and then sends that information to all processes:

    /* start timing computation */
    struct timeval before, after;
    gettimeofday(&before, NULL);

    /* calculate the start and end points by evenly dividing the range */
    int start = (N / size) * rank;
    int end = start + (N / size) - 1;

    /* the last process needs to do all remaining ones */
    if (rank == (size - 1)) {
        end = N - 1;

    /* compute if a number is prime or not */
    for (int i = start; i <= end; i++) {
        if (is_prime(i)) {
            primes[i - start] = 1;
        } else {
            primes[i - start] = 0;

    /* we are done with the computation */
    gettimeofday(&after, NULL);
    if (rank == 0) {
        printf("Computation done in %lf seconds\n", diff_time(before, after));

    /* gather the data back up to all processes */
    gettimeofday(&before, NULL);
    int* data = malloc(N * sizeof(int));

    MPI_Allgather(primes, N / size, MPI_INT, data, N / size, MPI_INT, MPI_COMM_WORLD);

    /* we are done with the communication */
    gettimeofday(&after, NULL);
    if (rank == 0) {
        printf("Communication done in %lf seconds\n", diff_time(before, after));

When run with N = 100,000,000 and 32 processors, how much time is spent doing computation vs. communication? How much time is spent on our Raspberry Pi cluster?

Copyright © 2022 Ian Finlayson | Licensed under a Attribution-NonCommercial 4.0 International License.