Hybrid MPI-OpenMP Jobs

For performance improvements, some code may hybridize MPI with OpenMP. For further performance improvements, core binding may be used. Core binding is the explicit assignment of a processor core to an OpenMP thread. Still more improvements may be made with Non-Uniform Memory Access (NUMA), where the memory used by each processor core is the memory that is physically closest.

On Picotte, integration of all the above is directly supported by Slurm + OpenMPI + OpenMP.

Basics

Some of the information here is already in the MPI article, but is reproduced here for convenience.

Terminology

An MPI job runs some number of processes, which are called ranks. In Slurm terminology, this is termed a task. (N.B. Slurm also uses the term “task” to refer to the individual “sub-jobs” of an array job.) Each rank is a separate process: if you run the ps command, or the top command, you will see an individual entry for each rank, all with the same name, which is the name of the program you run. (MPI allows these ranks to communicate with each other.) Each rank will, at most, use 100% CPU, i.e. complete utilization of a single CPU core; this is what will be shown by top.
An OpenMP job runs a single process, but uses multiple threads, i.e. there is only a single entry for the program if you look at the output of ps, or top. However, because this process is multithreaded, the utilization as shown by top will be greater than 100%: in fact, it should be about the number of threads multiplied by 100%.
A hybrid job will run fewer ranks, but each rank will run multithreaded. In principle, this reduces inter-process communication and replaces it with "shared memory" communication that requires less overhead. In practice, the performance will depend greatly on your specific application.

Basic MPI Job Configuration

As a start, MPI jobs should assign:

1 rank per socket
1 thread per CPU core

This can be changed manually, by passing appropriate values as arguments to $MPI_RUN. One may want to do this to find the best rank/thread distribution.

Example Program

Here is a simple hybrid program[1]:

#include <stdio.h>
#include "mpi.h"
#include <omp.h>

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int iam = 0, np = 1;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  #pragma omp parallel default(shared) private(iam, np)
  {
    np = omp_get_num_threads();
    iam = omp_get_thread_num();
    printf("Hello from thread %d out of %d from rank %d out of %d on %s\n",
           iam, np, rank, numprocs, processor_name);
  }

  MPI_Finalize();
}

Hardware Layout on Picotte

See: Message Passing Interface#Picotte hardware notes

Slurm on Picotte

Since Picotte uses Slurm, you can use the environment variable SLURM_CPUS_PER_TASK to set OMP_NUM_THREADS.

For example, you would like to run on all 96 cores of two nodes (each node has 48 cores), and have 12 threads per MPI rank (each thread assigned to its own CPU core). Your job script would look something like:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=3G

...

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
${MPI_RUN} prog_name

If you add the --display-map option to ${MPI_RUN}, it will output the mapping of ranks to CPU cores so that you can confirm it is the configuration you expect.

OpenMPI - Basic Usage on a Single Node

N.B. You must experiment to figure out the distribution of MPI ranks and OMP threads which gives best performance.

You can specify the number of threads per rank by setting the environment variable:

OMP_NUM_THREADS

You can write some bash to compute one value from the other, if you wish.

Example: 1 node, 4 ranks, 12 threads per rank

First, note the sinfo “S:C:T” (Socket:Core:Thread) value:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sinfo_detail
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
node001           1 def*       mixed   48   4:12:1   (null)   192000   140095   864000     4.02 none
node002           1 def*       mixed   48   4:12:1   (null)   192000   168169   864000     4.00 none
node003           1 def*       mixed   48   4:12:1   (null)   192000   166673   864000     4.00 none
...

We assign:

1 rank (task) per socket
1 thread per core (no Hyper-Threading)

We want 4 ranks per node, 12 OMP threads per rank, i.e. the whole job will have 8 ranks, each rank running 12 OMP threads.

#!/bin/bash
#SBATCH --partition=def
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
...
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
${MPI_RUN} --display-map ./hello-hybrid

The “--display-map” option to ${MPI_RUN} will print out a list of all nodes with their assigned ranks.

The output from this program will look like:

 Data for JOB [61963,1] offset 0 Total slots allocated 4

 ========================   JOB MAP   ========================

 Data for node: node020 Num slots: 4    Max slots: 0    Num procs: 4
    Process OMPI jobid: [61963,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [61963,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/./././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [61963,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B/././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [61963,1] App: 0 Process rank: 3 Bound: socket 0[core 3[hwt 0]]:[./././B/./././././././././././././././././././.][./././././././././././././././././././././././.]

 =============================================================
Hello from thread 0 out of 12 from rank 2 out of 4 on node020
Hello from thread 8 out of 12 from rank 2 out of 4 on node020
Hello from thread 9 out of 12 from rank 2 out of 4 on node020
Hello from thread 0 out of 12 from rank 3 out of 4 on node020
Hello from thread 8 out of 12 from rank 3 out of 4 on node020
Hello from thread 9 out of 12 from rank 3 out of 4 on node020
Hello from thread 0 out of 12 from rank 0 out of 4 on node020
Hello from thread 8 out of 12 from rank 0 out of 4 on node020
Hello from thread 9 out of 12 from rank 0 out of 4 on node020
Hello from thread 0 out of 12 from rank 1 out of 4 on node020
Hello from thread 8 out of 12 from rank 1 out of 4 on node020
Hello from thread 9 out of 12 from rank 1 out of 4 on node020
Hello from thread 10 out of 12 from rank 0 out of 4 on node020
Hello from thread 10 out of 12 from rank 2 out of 4 on node020
Hello from thread 10 out of 12 from rank 1 out of 4 on node020
Hello from thread 10 out of 12 from rank 3 out of 4 on node020
Hello from thread 11 out of 12 from rank 2 out of 4 on node020
Hello from thread 11 out of 12 from rank 0 out of 4 on node020
Hello from thread 11 out of 12 from rank 3 out of 4 on node020
Hello from thread 11 out of 12 from rank 1 out of 4 on node020
Hello from thread 4 out of 12 from rank 2 out of 4 on node020
Hello from thread 7 out of 12 from rank 2 out of 4 on node020
Hello from thread 6 out of 12 from rank 2 out of 4 on node020
Hello from thread 4 out of 12 from rank 1 out of 4 on node020
Hello from thread 7 out of 12 from rank 1 out of 4 on node020
Hello from thread 6 out of 12 from rank 1 out of 4 on node020
Hello from thread 4 out of 12 from rank 3 out of 4 on node020
Hello from thread 7 out of 12 from rank 3 out of 4 on node020
Hello from thread 6 out of 12 from rank 3 out of 4 on node020
Hello from thread 4 out of 12 from rank 0 out of 4 on node020
Hello from thread 7 out of 12 from rank 0 out of 4 on node020
Hello from thread 6 out of 12 from rank 0 out of 4 on node020
Hello from thread 5 out of 12 from rank 0 out of 4 on node020
Hello from thread 5 out of 12 from rank 3 out of 4 on node020
Hello from thread 5 out of 12 from rank 2 out of 4 on node020
Hello from thread 5 out of 12 from rank 1 out of 4 on node020
Hello from thread 3 out of 12 from rank 3 out of 4 on node020
Hello from thread 3 out of 12 from rank 1 out of 4 on node020
Hello from thread 3 out of 12 from rank 2 out of 4 on node020
Hello from thread 3 out of 12 from rank 0 out of 4 on node020
Hello from thread 2 out of 12 from rank 0 out of 4 on node020
Hello from thread 2 out of 12 from rank 2 out of 4 on node020
Hello from thread 2 out of 12 from rank 3 out of 4 on node020
Hello from thread 2 out of 12 from rank 1 out of 4 on node020
Hello from thread 1 out of 12 from rank 2 out of 4 on node020
Hello from thread 1 out of 12 from rank 1 out of 4 on node020
Hello from thread 1 out of 12 from rank 3 out of 4 on node020
Hello from thread 1 out of 12 from rank 0 out of 4 on node020

OpenMPI - Basic Usage on Multiple Nodes

As in the single-node example above, you will specify the number of threads per rank in the environment variable:

OMP_NUM_THREADS

~~You will also decide the number of ranks per node to run, by specifying this mpirun (or $MPI_RUN) option, where N is the number of ranks:~~

~~--map-by ppr:N:node~~

Example: 2 nodes, 8 ranks, 12 threads per rank

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
...

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
${MPI_RUN} --display-map ./hello-hybrid

The output, including the rank map, should look like the following. Note that each node (node020, and node021 have 4 slots (sockets), and 4 procs (i.e. ranks) assigned to each, i.e. each socket is runs one rank as desired.

 Data for JOB [64363,1] offset 0 Total slots allocated 8

 ========================   JOB MAP   ========================

 Data for node: node020 Num slots: 4    Max slots: 0    Num procs: 4
    Process OMPI jobid: [64363,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [64363,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/./././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [64363,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B/././././././././././././././././././././.][./././././././././././././././././././././././.]
    Process OMPI jobid: [64363,1] App: 0 Process rank: 3 Bound: socket 0[core 3[hwt 0]]:[./././B/./././././././././././././././././././.][./././././././././././././././././././././././.]

 Data for node: node021 Num slots: 4    Max slots: 0    Num procs: 4
    Process OMPI jobid: [64363,1] App: 0 Process rank: 4 Bound: N/A
    Process OMPI jobid: [64363,1] App: 0 Process rank: 5 Bound: N/A
    Process OMPI jobid: [64363,1] App: 0 Process rank: 6 Bound: N/A
    Process OMPI jobid: [64363,1] App: 0 Process rank: 7 Bound: N/A

 =============================================================
Hello from thread 0 out of 12 from rank 2 out of 8 on node020
Hello from thread 8 out of 12 from rank 2 out of 8 on node020
Hello from thread 9 out of 12 from rank 2 out of 8 on node020
Hello from thread 0 out of 12 from rank 3 out of 8 on node020
...
Hello from thread 8 out of 12 from rank 7 out of 8 on node021
Hello from thread 9 out of 12 from rank 7 out of 8 on node021
Hello from thread 11 out of 12 from rank 0 out of 8 on node020
Hello from thread 9 out of 12 from rank 4 out of 8 on node021
Hello from thread 11 out of 12 from rank 1 out of 8 on node020
Hello from thread 11 out of 12 from rank 2 out of 8 on node020
...
Hello from thread 1 out of 12 from rank 2 out of 8 on node020
Hello from thread 1 out of 12 from rank 4 out of 8 on node021
Hello from thread 1 out of 12 from rank 5 out of 8 on node021
Hello from thread 1 out of 12 from rank 7 out of 8 on node021
Hello from thread 1 out of 12 from rank 6 out of 8 on node021

If you wish, you can compute the various arguments based on OMP_NUM_THREADS:

export OMP_NUM_THREADS=4
nranks=$( expr $SLURM_NPROCS / $OMP_NUM_THREADS )
nrankspernode=$( expr $nranks / $SLURM_NNODES )
${MPI_RUN} -np ${nranks} --map-by ppr:${nrankspernode}:node --display-map ./hello-hybrid

You should not have to do this unless you would like to experiment to figure out what distribution of MPI ranks and threads gives the best performance.

Sample Benchmarks

Here are LAMMPS benchmarks with varying numbers of threads per MPI rank: LAMMPS#Benchmark Results with Different Slot Distributions

References

[1] Stanford Linear Accelerator Computing Facility documentation - Mixing MPI and OpenMP article