Hybrid MPI-OpenMP Jobs
For performance improvements, some code may hybridize MPI with OpenMP. For further performance improvements, core binding may be used. Core binding is the explicit assignment of a processor core to an OpenMP thread. Still more improvements may be made with Non-Uniform Memory Access (NUMA), where the memory used by each processor core is the memory that is physically closest.
On Picotte, integration of all the above is directly supported by Slurm + OpenMPI + OpenMP.
Basics
Some of the information here is already in the MPI article, but is reproduced here for convenience.
Terminology
- An MPI job runs some number of processes, which are called
ranks. In Slurm terminology, this is termed a task. (N.B.
Slurm also uses the term “task” to refer to the individual
“sub-jobs” of an array job.) Each rank is a separate process: if you
run the
ps
command, or thetop
command, you will see an individual entry for each rank, all with the same name, which is the name of the program you run. (MPI allows these ranks to communicate with each other.) Each rank will, at most, use 100% CPU, i.e. complete utilization of a single CPU core; this is what will be shown bytop
. - An OpenMP job runs a single process, but uses multiple threads, i.e.
there is only a single entry for the program if you look at the
output of
ps
, ortop
. However, because this process is multithreaded, the utilization as shown bytop
will be greater than 100%: in fact, it should be about the number of threads multiplied by 100%. - A hybrid job will run fewer ranks, but each rank will run multithreaded. In principle, this reduces inter-process communication and replaces it with "shared memory" communication that requires less overhead. In practice, the performance will depend greatly on your specific application.
Basic MPI Job Configuration
As a start, MPI jobs should assign:
- 1 rank per socket
- 1 thread per CPU core
This can be changed manually, by passing appropriate values as arguments
to $MPI_RUN
. One may want to do this to find the best rank/thread
distribution.
Example Program
Here is a simple hybrid program[1]:
#include <stdio.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
{
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hello from thread %d out of %d from rank %d out of %d on %s\n",
iam, np, rank, numprocs, processor_name);
}
MPI_Finalize();
}
Hardware Layout on Picotte
See: Message Passing Interface#Picotte hardware notes
Slurm on Picotte
Since Picotte uses Slurm, you can use the environment variable
SLURM_CPUS_PER_TASK
to set OMP_NUM_THREADS
.
For example, you would like to run on all 96 cores of two nodes (each node has 48 cores), and have 12 threads per MPI rank (each thread assigned to its own CPU core). Your job script would look something like:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=3G
...
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
${MPI_RUN} prog_name
If you add the --display-map
option to ${MPI_RUN}, it will output the
mapping of ranks to CPU cores so that you can confirm it is the
configuration you expect.
OpenMPI - Basic Usage on a Single Node
N.B. You must experiment to figure out the distribution of MPI ranks and OMP threads which gives best performance.
You can specify the number of threads per rank by setting the environment variable:
OMP_NUM_THREADS
You can write some bash to compute one value from the other, if you wish.
Example: 1 node, 4 ranks, 12 threads per rank
First, note the sinfo
“S:C:T” (Socket:Core:Thread) value:
[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sinfo_detail
NODELIST NODES PART STATE CPUS S:C:T GRES MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
node001 1 def* mixed 48 4:12:1 (null) 192000 140095 864000 4.02 none
node002 1 def* mixed 48 4:12:1 (null) 192000 168169 864000 4.00 none
node003 1 def* mixed 48 4:12:1 (null) 192000 166673 864000 4.00 none
...
We assign:
- 1 rank (task) per socket
- 1 thread per core (no Hyper-Threading)
We want 4 ranks per node, 12 OMP threads per rank, i.e. the whole job will have 8 ranks, each rank running 12 OMP threads.
#!/bin/bash
#SBATCH --partition=def
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
...
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
${MPI_RUN} --display-map ./hello-hybrid
The “--display-map
” option to ${MPI_RUN}
will print out a list of
all nodes with their assigned ranks.
The output from this program will look like:
Data for JOB [61963,1] offset 0 Total slots allocated 4
======================== JOB MAP ========================
Data for node: node020 Num slots: 4 Max slots: 0 Num procs: 4
Process OMPI jobid: [61963,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [61963,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/./././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [61963,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B/././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [61963,1] App: 0 Process rank: 3 Bound: socket 0[core 3[hwt 0]]:[./././B/./././././././././././././././././././.][./././././././././././././././././././././././.]
=============================================================
Hello from thread 0 out of 12 from rank 2 out of 4 on node020
Hello from thread 8 out of 12 from rank 2 out of 4 on node020
Hello from thread 9 out of 12 from rank 2 out of 4 on node020
Hello from thread 0 out of 12 from rank 3 out of 4 on node020
Hello from thread 8 out of 12 from rank 3 out of 4 on node020
Hello from thread 9 out of 12 from rank 3 out of 4 on node020
Hello from thread 0 out of 12 from rank 0 out of 4 on node020
Hello from thread 8 out of 12 from rank 0 out of 4 on node020
Hello from thread 9 out of 12 from rank 0 out of 4 on node020
Hello from thread 0 out of 12 from rank 1 out of 4 on node020
Hello from thread 8 out of 12 from rank 1 out of 4 on node020
Hello from thread 9 out of 12 from rank 1 out of 4 on node020
Hello from thread 10 out of 12 from rank 0 out of 4 on node020
Hello from thread 10 out of 12 from rank 2 out of 4 on node020
Hello from thread 10 out of 12 from rank 1 out of 4 on node020
Hello from thread 10 out of 12 from rank 3 out of 4 on node020
Hello from thread 11 out of 12 from rank 2 out of 4 on node020
Hello from thread 11 out of 12 from rank 0 out of 4 on node020
Hello from thread 11 out of 12 from rank 3 out of 4 on node020
Hello from thread 11 out of 12 from rank 1 out of 4 on node020
Hello from thread 4 out of 12 from rank 2 out of 4 on node020
Hello from thread 7 out of 12 from rank 2 out of 4 on node020
Hello from thread 6 out of 12 from rank 2 out of 4 on node020
Hello from thread 4 out of 12 from rank 1 out of 4 on node020
Hello from thread 7 out of 12 from rank 1 out of 4 on node020
Hello from thread 6 out of 12 from rank 1 out of 4 on node020
Hello from thread 4 out of 12 from rank 3 out of 4 on node020
Hello from thread 7 out of 12 from rank 3 out of 4 on node020
Hello from thread 6 out of 12 from rank 3 out of 4 on node020
Hello from thread 4 out of 12 from rank 0 out of 4 on node020
Hello from thread 7 out of 12 from rank 0 out of 4 on node020
Hello from thread 6 out of 12 from rank 0 out of 4 on node020
Hello from thread 5 out of 12 from rank 0 out of 4 on node020
Hello from thread 5 out of 12 from rank 3 out of 4 on node020
Hello from thread 5 out of 12 from rank 2 out of 4 on node020
Hello from thread 5 out of 12 from rank 1 out of 4 on node020
Hello from thread 3 out of 12 from rank 3 out of 4 on node020
Hello from thread 3 out of 12 from rank 1 out of 4 on node020
Hello from thread 3 out of 12 from rank 2 out of 4 on node020
Hello from thread 3 out of 12 from rank 0 out of 4 on node020
Hello from thread 2 out of 12 from rank 0 out of 4 on node020
Hello from thread 2 out of 12 from rank 2 out of 4 on node020
Hello from thread 2 out of 12 from rank 3 out of 4 on node020
Hello from thread 2 out of 12 from rank 1 out of 4 on node020
Hello from thread 1 out of 12 from rank 2 out of 4 on node020
Hello from thread 1 out of 12 from rank 1 out of 4 on node020
Hello from thread 1 out of 12 from rank 3 out of 4 on node020
Hello from thread 1 out of 12 from rank 0 out of 4 on node020
OpenMPI - Basic Usage on Multiple Nodes
As in the single-node example above, you will specify the number of threads per rank in the environment variable:
OMP_NUM_THREADS
You will also decide the number of ranks per node to run, by
specifying this mpirun
(or $MPI_RUN
) option, where N is the
number of ranks:
--map-by ppr:
N
:node
Example: 2 nodes, 8 ranks, 12 threads per rank
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
...
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
${MPI_RUN} --display-map ./hello-hybrid
The output, including the rank map, should look like the following. Note
that each node (node020
, and node021
have 4 slots (sockets), and 4
procs (i.e. ranks) assigned to each, i.e. each socket is runs one rank
as desired.
Data for JOB [64363,1] offset 0 Total slots allocated 8
======================== JOB MAP ========================
Data for node: node020 Num slots: 4 Max slots: 0 Num procs: 4
Process OMPI jobid: [64363,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [64363,1] App: 0 Process rank: 1 Bound: socket 0[core 1[hwt 0]]:[./B/./././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [64363,1] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0]]:[././B/././././././././././././././././././././.][./././././././././././././././././././././././.]
Process OMPI jobid: [64363,1] App: 0 Process rank: 3 Bound: socket 0[core 3[hwt 0]]:[./././B/./././././././././././././././././././.][./././././././././././././././././././././././.]
Data for node: node021 Num slots: 4 Max slots: 0 Num procs: 4
Process OMPI jobid: [64363,1] App: 0 Process rank: 4 Bound: N/A
Process OMPI jobid: [64363,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [64363,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [64363,1] App: 0 Process rank: 7 Bound: N/A
=============================================================
Hello from thread 0 out of 12 from rank 2 out of 8 on node020
Hello from thread 8 out of 12 from rank 2 out of 8 on node020
Hello from thread 9 out of 12 from rank 2 out of 8 on node020
Hello from thread 0 out of 12 from rank 3 out of 8 on node020
...
Hello from thread 8 out of 12 from rank 7 out of 8 on node021
Hello from thread 9 out of 12 from rank 7 out of 8 on node021
Hello from thread 11 out of 12 from rank 0 out of 8 on node020
Hello from thread 9 out of 12 from rank 4 out of 8 on node021
Hello from thread 11 out of 12 from rank 1 out of 8 on node020
Hello from thread 11 out of 12 from rank 2 out of 8 on node020
...
Hello from thread 1 out of 12 from rank 2 out of 8 on node020
Hello from thread 1 out of 12 from rank 4 out of 8 on node021
Hello from thread 1 out of 12 from rank 5 out of 8 on node021
Hello from thread 1 out of 12 from rank 7 out of 8 on node021
Hello from thread 1 out of 12 from rank 6 out of 8 on node021
If you wish, you can compute the various arguments based on
OMP_NUM_THREADS
:
export OMP_NUM_THREADS=4
nranks=$( expr $SLURM_NPROCS / $OMP_NUM_THREADS )
nrankspernode=$( expr $nranks / $SLURM_NNODES )
${MPI_RUN} -np ${nranks} --map-by ppr:${nrankspernode}:node --display-map ./hello-hybrid
You should not have to do this unless you would like to experiment to figure out what distribution of MPI ranks and threads gives the best performance.
Sample Benchmarks
- Here are LAMMPS benchmarks with varying numbers of threads per MPI rank: LAMMPS#Benchmark Results with Different Slot Distributions
See Also
- MPI
- Supercomputing '13 Tutorial: Hybrid MPI and OpenMP Parallel Programming
- NASA High-End Computing Capability Knowledge Base: Using Intel OpenMP Thread Affinity for Pinning
References
[1] Stanford Linear Accelerator Computing Facility documentation - Mixing MPI and OpenMP article