MPI Odds & Ends

MPI Implementations

"MPI" is not a particular piece of code, but a standard. Like programming language standards, there are multiple possible implementations of this standard.

Vendors of high performance computing systems have proprietary implementations. There are two popular free implementations, however:

MPICH
This is the oldest implementation of MPI and many proprietary implementations are based on it. MPICH supports more platforms than OpenMPI.
OpenMPI
This is a newer implementation that does not support as many platforms, but has been tested as being faster in some situations. OpenMPI is the version installed on cpsc.

MPI Barriers

Recall that a barrier is a point in a parallel program where you want all threads or processes to reach before any of them continue past it.

A barrier is easy to achieve in an MPI program with the MPI_Barrier function:


int MPI_Barrier(MPI_Comm comm)

The only parameter is the communicator group. When called, the process will remain in the barrier until all processes in the communicator group have called it.

The following program uses MPI_Barrier to implement a barrier in MPI:


#include <unistd.h>
#include <stdlib.h>
#include <mpi.h>
#include <stdio.h>


int main(int argc, char** argv) {
    int rank, size;

    /* initialize MPI */
    MPI_Init(&argc, &argv);

    /* get the rank (process rank) and size (number of processes) */
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* we start to work */
    printf("Process %d starting!\n", rank);

    /* simulate the processes taking slightly different amounts of time by sleeping
     * for our process rank seconds */
    sleep(rank);
    printf("Process %d is done its work!\n", rank);

    /* a barrier */
    MPI_Barrier(MPI_COMM_WORLD);

    printf("Process %d is past the barrier!\n", rank); 

    /* quit MPI */
    MPI_Finalize();
    return 0;
}

Creating Communicator Groups

We have seen that most MPI functions deal with a communicator group. Often, just using the MPI_COMM_WORLD communicator group is sufficient. Other times, we may want to have a subset of processes work on something.

To do this, we first need to create an MPI_Group structure to hold a subset of out processes. The MPI_Group_incl function is used to specify a group:


MPI_Group_incl(orig_group, size, ranks, new_group);

orig_group
An MPI_Group structure that refers to the superset group. We can get a group from a communicator with the MPI_Comm_group function.
size
The number of processes in the new group.
ranks
An integer array which holds the ranks (relative to the superset group) of the processes included in the subset group.
new_group
A pointer to the new group that is being created (output parameter).

The MPI_Group is used for referring to a set of processes. The MPI_Comm is used for communicating amongst a group. In order to communicate amongst our group, we need to create a communicator as well with MPI_Comm_create:


MPI_Comm_create(super_comm, group, new_comm);

super_comm
The communicator group that is a superset of the one we are creating. This can be MPI_COMM_WORLD.
group
The group we want this communicator to refer to.
new_comm
A pointer to a MPI_Comm structure that will be created.

The following program uses these functions to create two new communicator groups: one for the first half of processes, and one for the second:


#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    /* our rank and size */
    int rank, size;

    /* we need to create groups of processes */
    MPI_Group orig_group, new_group;

    /* find our rank and size in MPI_COMM_WORLD */
    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    /* make arrays storing the ranks of the processes in group1 and group2 */
    int ranks1[size / 2], ranks2[size / 2], i;
    for (i = 0; i < size/2; i++) {
        ranks1[i] = i;
    }
    for (i = size/2; i < size; i++) {
        ranks2[i - size/2] = i;
    }

    /* find the original group */
    MPI_Comm_group(MPI_COMM_WORLD, &orig_group);

    /* Divide tasks into two distinct groups based upon rank */
    if (rank < size/2) {
        MPI_Group_incl(orig_group, size/2, ranks1, &new_group);
    } else {
        MPI_Group_incl(orig_group, size/2, ranks2, &new_group);
    }

    /* Create new communicator for our group */
    MPI_Comm new_comm;
    MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_comm);

    /* have the processes sum the ranks of each group */
    int send = rank, recv;
    MPI_Allreduce(&send, &recv, 1, MPI_INT, MPI_SUM, new_comm);

    /* get our rank within the new group */
    int grank;
    MPI_Comm_rank(new_comm, &grank);

    /* print the results */
    printf("Process %d (%d in sub-group) has %d!\n", rank, grank, recv);

    /* quit */
    MPI_Finalize();
    return 0;
}

Distributed MPI

Multi-threading systems like Pthreads and OpenMP can run on shared memory systems, such as server and desktop computers with multiple CPU cores. In these systems, multiple threads work from within one process.

Multi-processing systems like MPI can also run on multi-core computer systems. Here, multiple processes work together, sharing data when needed by passing messages.

However MPI can also be run on distributed systems which consist of multiple independent computer systems connected together over a network. This can be either a cluster of computers which are connected together on a LAN, or computers working over the wider internet.

Running MPI on a Cluster

Clusters can be created out of computers connected together on a LAN. In order to run parallel programs on a cluster with MPI, one must have a user account on each node of the cluster, and be able to SSH to each node.

Also, the MPI program and all data files must be present on each node. This can be achieved by using a network file system like NFS.

To run a distributed program, we can pass mpirun the --hostfile parameter along with a file containing the hosts which can run processes:

$ mpirun -np 4 --hostfile hosts ./program

Where "hosts" is just a text file containing the hostnames, or IP addresses of machines which can launch processes, one per line. For instance to run this program on 4 nodes with hostnames, we could use the following hostfile:


node0
node1
node2
node3

or we could use IP addresses instead:


192.168.1.10
192.168.1.11
192.168.1.12
192.168.1.13

We can also run processes across a cluster, and also have multiple processes running on each node. For instance, if there are 4 machines which each have 2 processor cores, we could use the following hostfile:


node0 slots=2
node1 slots=2
node2 slots=2
node3 slots=2

Now MPI will know that there are 8 total processor cores it can utilize.

It's possible that we will want more processes running than we have resources for. What happens if we try to launch 16 processes on this cluster with only 8 actual CPUs? MPI will launch 4 on each node of the cluster, they will just have to share execution time.