Skip to content

Message Passing Interface

Overview

The Message Passing Interface is a standardized and portable message-passing system for parallel computation using multiple physical computers (nodes).[1][2]

There are multiple implementations of the standard, some of which are available on Proteus.

Available Implementations

Picotte

These packages are compiler-specific:

  • Open MPI
    • GCC
      • picotte-openmpi/gcc/4.0.5
      • picotte-openmpi/gcc/4.1.0
      • picotte-openmpi/gcc/4.1.4
    • Intel ICC
      • picotte-openmpi/intel/2020/4.0.5
      • picotte-openmpi/intel/2020/4.1.0
      • picotte-openmpi/intel/2020/4.1.2
      • picotte-openmpi/intel/2020/4.1.4
    • CUDA-enabled -- first, do "module use /ifs/opt_cuda/modulefiles"
      • picotte-openmpi/cuda11.0/4.0.5 (Uses GCC)
      • picotte-openmpi/cuda11.0/4.1.0 (Uses GCC)
      • picotte-openmpi/cuda11.2/4.1.0 (Uses Intel ICC)
      • picotte-openmpi/cuda11.2/4.1.4 (Uses Intel ICC)
      • picotte-openmpi/cuda11.4/4.1.4 (GCC)

For the CUDA-enabled implementations, you can see the compiler version by doing:

[juser@gpu001 ~]$ module load picotte-openmpi/cuda11.4/4.1.4
[juser@gpu001 ~]$ mpicc --version
icc (ICC) 19.1.3.304 20200925
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

The CUDA-enabled implementations also require loading

Picotte hardware notes

Details of all hardware are at: Picotte Hardware and Software.

All Picotte compute nodes use one of these two CPUs:

You should not need to manually specify NUMA layouts since Open MPI uses hwloc to determine it.

You can view the "socket, core, thread" configuration using the sinfo_detail alias from the slurm_util modulefile:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sinfo_detail -p def
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
node001           1 def*       mixed   48   4:12:1   (null)   192000   140440   864000     4.08 none
node002           1 def*       mixed   48   4:12:1   (null)   192000   171040   864000     3.00 none
node003           1 def*       mixed   48   4:12:1   (null)   192000   171621   864000     3.03 none
node004           1 def*       mixed   48   4:12:1   (null)   192000   147536   864000     3.00 none
node005           1 def*       mixed   48   4:12:1   (null)   192000   162570   864000     3.00 none
node006           1 def*       mixed   48   4:12:1   (null)   192000   169135   864000     2.99 none
...
node072           1 def*       mixed   48   4:12:1   (null)   192000   157423   864000     3.00 none
node073           1 def*       mixed   48   4:12:1   (null)   192000   157114   864000     3.00 none
node074           1 def*       mixed   48   4:12:1   (null)   192000   152783   864000     3.00 none
[juser@picotte001 ~]$ sinfo_detail -p gpu
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
gpu001            1 gpu        mixed   48   2:24:1 gpu:v100   192000    18191  1637000     4.69 none
gpu002            1 gpu         idle   48   2:24:1 gpu:v100   192000    94592  1637000     0.00 none
...
gpu011            1 gpu         idle   48   2:24:1 gpu:v100   192000    39289  1637000     0.01 none
gpu012            1 gpu         idle   48   2:24:1 gpu:v100   192000   142535  1637000     0.15 none
[juser@picotte001 ~]$ sinfo_detail -p bm
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
bigmem001         1 bm          idle   48   2:24:1   (null)  1546000  1368526  1724000     0.00 none
bigmem002         1 bm          idle   48   2:24:1   (null)  1546000  1541778  1724000     0.00 none

The column “S:C:T” shows “Socket”, “Core”, and “Thread”. Here, “Thread” means Intel’s Hyper-Threading,[3] where a single physical core is presented by the hardware as two virtual cores. This feature may increase performance in consumer applications (Office, web browsing, etc.) but will decrease performance in compute-intensive applications. In an HPC context, Hyper-Threading is always turned off, so T=1.

Open MPI

Note that Open MPI is not OpenMP[4]. OpenMP is an API for multi-platform shared-memory parallel programming in C/C++ and Fortran, i.e. single-host multithreaded programming on our compute nodes. Open MPI is an implementation of the MPI-2 standard, which provides multi-host parallel execution. Open MPI uses OpenMP for single-host shared-memory parallel execution.

Common Environment Variables

OpenMPI may be controlled by environment variables named OMPI_*. Please note that some of these should not be changed because they define necessary compile-time flags, and library locations.

For convenience, these environment variables are set -- actual values will vary by version loaded:

MPICC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpicc
MPI_CPPFLAGS=-I/ifs/opt/openmpi/intel/2020/4.1.4/include
MPICXX=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpic++
MPIF77=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif77
MPIF90=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif90
MPIFC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpifort
MPI_HOME=/ifs/opt/openmpi/intel/2020/4.1.4
MPI_INCDIR=/ifs/opt/openmpi/intel/2020/4.1.4/include
MPI_LIBDIR=/ifs/opt/openmpi/intel/2020/4.1.4/lib
MPI_RUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
MPIRUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
OMPI_CFLAGS=-fopenmp
OMPI_LDFLAGS=-L/ifs/opt/openmpi/intel/2020/4.1.4/lib -Wl,-rpath -Wl,/ifs/opt/openmpi/intel/2020/4.1.4/lib

Running

Invocation of mpirun is the same as the others. Using the full path to the "mpirun" command is recommended. It is given by the MPI_RUN environment variable:

     ${MPI_RUN} myprogram --opt optval

The MPI_RUN environment variable also sets some common command line options to export environment variables: -x LD_LIBRARY_PATH -x BASH_ENV; so, you do not need to set them manually.

Example: job requests 2 nodes, 48 MPI ranks per node:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48

See sbatch documentation[5] (or man page on picotte001) for more detailed information.

Performance Differences

Performance differences tend to be very application-specific. However, there is some experience which indicates that on Intel CPUs, Intel MPI tends to perform better.[6]

Hybrid MPI-OpenMP Jobs

Please see Hybrid MPI-OpenMP Jobs.

MPI for Python

MPI for Python (a.k.a. mpi4py) is a Python module which takes advantage of MPI.[7]

Please see: MPI for Python

OBSOLETE

Parallel Environment (PE)

OBSOLETE Slurm does not use parallel environments

Use one of:

  • openmpi_shm -- single node; this replaces "shm" for OpenMPI jobs
  • openmpi_ib -- multiple nodes, CPU cores are assigned in "fill_up" order
  • openmpi_ib_rr -- multiple nodes, CPU cores are assigned in "round_robin" order

Other PEs may be used for specific distribution of slots. E.g. if you want to run Hybrid MPI-OpenMP Jobs, or if you just want to have a specific distribution of slots. The Open MPI versions installed locally are integrated with Grid Engine, and will be able to read the environment to determine the how job slots have been distributed across nodes.

Common Environment Variables

OpenMPI may be controlled by environment variables named OMPI_*. Please note that some of these should not be changed because they define necessary compile-time flags, and library locations.

For convenience, these environment variables are set:

MPICC     =   mpicc MPICXX    =   mpic++ MPIF77    =   mpif77 MPIF90    =   mpif90 MPIFC     =   mpif90 MPI_HOME  =   base directory of OpenMPI installation

Running under Grid Engine

Invocation of mpirun is the same as the others. Using the full path to the "mpirun" command is recommended. It is given by the MPI_RUN environment variable:

${MPI_RUN} myprogram --opt optval

Note that OpenMPI on Proteus is integrated with Grid Engine, using the proteus-openmpi modules, the "-n $NSLOTS" option may be left out.

OpenMPI will automatically select the fastest network fabric available, which will be InfiniBand in the case of Proteus. You may, if you wish, specify the specific Byte Transfer Layer (BTL, i.e. point-to-point bytewise communication method) in order of decreasing preference:

${MPI_RUN} -mca btl openib,tcp,self myprogram --opt optval

Please see the OpenMPI FAQ[8] for more details.

You may see all available MCA parameters, including a listing of all available BTLs:

ompi_info -mca btl list

If your program generates errors referring to LIBC or LIBCXX, you will need to make sure the LD_LIBRARY_PATH environment variable is exported to all child processes:

mpirun -x LD_LIBRARY_PATH myprogram

The MPI_RUN environment variable is defined to be "mpirun -x LD_LIBRARY_PATH" so it should just work.

Please see the man page for mpirun for many more options.

For running jobs using > 128 slots, an OpenMPI system parameter may need to be modified. Add the following to the ${MPI_RUN} options:

-mca plm_rsh_num_concurrent 256

MXM warnings about ulimit

The low-level Mellanox MXM communications library may warn about the stacksize limit:

mxm.c:196  MXM  WARN  The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240)

If that is the case, modify the MPI_RUN line:

${MPI_RUN} -x LD_LIBRARY_PATH bash -c "ulimit -s 10240 && myprogram"

Core Binding

Core binding should be left to Grid Engine, which is aware of the usage by other jobs. This is done with the "-binding" option for qsub. See the binding options for qsub in the man page. There is brief information in the UGE User Guide.

When the "-binding" option is given to qsub, the PE_HOSTFILE will contain the core binding topology, which is then read by mpirun. WARNING: do not use "explicit" binding: your job should rely on Grid Engine to select free processor cores, rather than picking a specific core in a specific socket.

Example:

#$ -pe openmpi_ib 32 #$ -binding pe striding_automatic:32:1

MVAPICH2

Common Environment Variables

MPI_HOME MPI_INCDIR = $MPI_HOME/include MPI_CPPFLAGS = -L$MPI_INCDIR MPI_LIBDIR = $MPI_HOME/lib MPI_LDFLAGS = -L$MPI_LIBDIR MPI_LIBS = -lmpich -lmpl -libmad -libumad -libverbs -ldl -lrt -lnuma -lrdmacm -lm -lpthread MPICC = mpicc MPICXX = mpic++ MPIFC = mpif90 MPIF90 = mpif90 MPIF77 = mpif77

GCC

Module to use:

proteus-mvapich2/gcc/64/1.9-mlnx-ofed

Intel

Module to use:

proteus-mvapich2/intel/64/1.9-mlnx-ofed

AMD

Module to use:

proteus-mvapich2/open64

Compiling and Linking

Both dynamic and static libraries are available. Example makefile snippet:

CPPFLAGS = $(MPI_CPPFLAGS) -I/opther/incdir LDFLAGS = $(MPI_LDFLAGS) -L/other/libdir LIBS = -lotherlib $(MPI_LIBS) -lm foo: foo.o $(MPICC) $(CPPFLAGS) -o $@ $* $(LDFLAGS) $(LIBS)

Use of dynamic libraries may mean adding an rpath.

Running

Within a job script:

mpirun myprogram.exe

Normally, mpirun will detect what scheduler and resource manager it is running under, so "-rmk sge" need not be used. The option "-rmk sge" tells mpirun to read the Grid Engine environment for the hosts file. There are many other options: use the "--help" option, and see http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager

NB mpirun is actually an alias to mpiexec. You can also use the MPI_RUN environment variable to be sure to get the right version:

$MPI_RUN myprogram.exe

mpi4py

If you are using the Python package, mpi4py, you may need to add an option to mpirun, e.g.

$MPI_RUN -genv LD_PRELOAD ${MPI_HOME}/lib/libmpi.so python my_python_mpi_script.py

Local Documentation

$MPI_HOME/share/doc/mvapich2

Also available via "man" command.

Intel MPI

NOTE We do not have a current license for this as for 2015-01-01.

This is for Intel CPUs only.

Running

Use one of these PEs:

  • intelmpi
  • intelmpi_rr

These PEs have a pre-processing step which informs Intel MPI about the distribution of slots and ndoes.

Number of processes need not be specified for mpirun, but use "-rmk sge" to specify Grid Engine integration:

mpirun -rmk sge my_program arg1 arg2

Parallel Environments for Jobs

For most parallel jobs, it is probably more efficient to use multiples of whole nodes. To do that use either the "fixed16" or "fixed64" parallel environments. While this may mean waiting for full nodes to become free, the jobs would probably run faster than one where the slot allocations are spread widely. Please see the article on Writing Job Scripts for details on appropriate parallel environments.

References

See Also


[1] The Message Passing Interface (MPI) official website

[2] Message Passing Interface Wikipedia article

[3] Intel® Hyper-Threading Technology

[4] OpenMP official website

[5] Slurm 21.08.8 Documentation - sbatch

[6] StackOverflow: Will mvapich be substantially better than openmpi? And How?

[7] MPI for Python Bitbucket project page

[8] OpenMPI FAQ - Tuning: Selecting Components