Message Passing Interface
Overview
The Message Passing Interface is a standardized and portable message-passing system for parallel computation using multiple physical computers (nodes).[1][2]
There are multiple implementations of the standard, some of which are available on Proteus.
Available Implementations
Picotte
These packages are compiler-specific:
- Open MPI
- GCC
- picotte-openmpi/gcc/4.0.5
- picotte-openmpi/gcc/4.1.0
- picotte-openmpi/gcc/4.1.4
- Intel ICC
- picotte-openmpi/intel/2020/4.0.5
- picotte-openmpi/intel/2020/4.1.0
- picotte-openmpi/intel/2020/4.1.2
- picotte-openmpi/intel/2020/4.1.4
- CUDA-enabled -- first, do
"
module use /ifs/opt_cuda/modulefiles
"- picotte-openmpi/cuda11.0/4.0.5 (Uses GCC)
- picotte-openmpi/cuda11.0/4.1.0 (Uses GCC)
- picotte-openmpi/cuda11.2/4.1.0 (Uses Intel ICC)
- picotte-openmpi/cuda11.2/4.1.4 (Uses Intel ICC)
- picotte-openmpi/cuda11.4/4.1.4 (GCC)
- GCC
For the CUDA-enabled implementations, you can see the compiler version by doing:
[juser@gpu001 ~]$ module load picotte-openmpi/cuda11.4/4.1.4
[juser@gpu001 ~]$ mpicc --version
icc (ICC) 19.1.3.304 20200925
Copyright (C) 1985-2020 Intel Corporation. All rights reserved.
The CUDA-enabled implementations also require loading
Picotte hardware notes
Details of all hardware are at: Picotte Hardware and Software.
All Picotte compute nodes use one of these two CPUs:
- Intel Xeon Platinum 8268
(
def
,bm
partitions) - Intel Xeon Platinum 8260
(
gpu
partition)
You should not need to manually specify
NUMA layouts since
Open MPI uses hwloc
to
determine it.
You can view the "socket, core, thread" configuration using the
sinfo_detail
alias from the slurm_util
modulefile:
[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sinfo_detail -p def
NODELIST NODES PART STATE CPUS S:C:T GRES MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
node001 1 def* mixed 48 4:12:1 (null) 192000 140440 864000 4.08 none
node002 1 def* mixed 48 4:12:1 (null) 192000 171040 864000 3.00 none
node003 1 def* mixed 48 4:12:1 (null) 192000 171621 864000 3.03 none
node004 1 def* mixed 48 4:12:1 (null) 192000 147536 864000 3.00 none
node005 1 def* mixed 48 4:12:1 (null) 192000 162570 864000 3.00 none
node006 1 def* mixed 48 4:12:1 (null) 192000 169135 864000 2.99 none
...
node072 1 def* mixed 48 4:12:1 (null) 192000 157423 864000 3.00 none
node073 1 def* mixed 48 4:12:1 (null) 192000 157114 864000 3.00 none
node074 1 def* mixed 48 4:12:1 (null) 192000 152783 864000 3.00 none
[juser@picotte001 ~]$ sinfo_detail -p gpu
NODELIST NODES PART STATE CPUS S:C:T GRES MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
gpu001 1 gpu mixed 48 2:24:1 gpu:v100 192000 18191 1637000 4.69 none
gpu002 1 gpu idle 48 2:24:1 gpu:v100 192000 94592 1637000 0.00 none
...
gpu011 1 gpu idle 48 2:24:1 gpu:v100 192000 39289 1637000 0.01 none
gpu012 1 gpu idle 48 2:24:1 gpu:v100 192000 142535 1637000 0.15 none
[juser@picotte001 ~]$ sinfo_detail -p bm
NODELIST NODES PART STATE CPUS S:C:T GRES MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
bigmem001 1 bm idle 48 2:24:1 (null) 1546000 1368526 1724000 0.00 none
bigmem002 1 bm idle 48 2:24:1 (null) 1546000 1541778 1724000 0.00 none
The column “S:C:T
” shows “Socket”, “Core”, and “Thread”. Here,
“Thread” means Intel’s Hyper-Threading,[3] where a single physical core
is presented by the hardware as two virtual cores. This feature may
increase performance in consumer applications (Office, web browsing,
etc.) but will decrease performance in compute-intensive applications.
In an HPC context, Hyper-Threading is always turned off, so T=1
.
Open MPI
Note that Open MPI is not OpenMP[4]. OpenMP is an API for multi-platform shared-memory parallel programming in C/C++ and Fortran, i.e. single-host multithreaded programming on our compute nodes. Open MPI is an implementation of the MPI-2 standard, which provides multi-host parallel execution. Open MPI uses OpenMP for single-host shared-memory parallel execution.
Common Environment Variables
OpenMPI may be controlled by environment variables named OMPI_*
.
Please note that some of these should not be changed because they define
necessary compile-time flags, and library locations.
For convenience, these environment variables are set -- actual values will vary by version loaded:
MPICC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpicc
MPI_CPPFLAGS=-I/ifs/opt/openmpi/intel/2020/4.1.4/include
MPICXX=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpic++
MPIF77=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif77
MPIF90=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif90
MPIFC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpifort
MPI_HOME=/ifs/opt/openmpi/intel/2020/4.1.4
MPI_INCDIR=/ifs/opt/openmpi/intel/2020/4.1.4/include
MPI_LIBDIR=/ifs/opt/openmpi/intel/2020/4.1.4/lib
MPI_RUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
MPIRUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
OMPI_CFLAGS=-fopenmp
OMPI_LDFLAGS=-L/ifs/opt/openmpi/intel/2020/4.1.4/lib -Wl,-rpath -Wl,/ifs/opt/openmpi/intel/2020/4.1.4/lib
Running
Invocation of mpirun
is the same as the others. Using the full
path to the "mpirun" command is recommended. It is given by the
MPI_RUN
environment variable:
${MPI_RUN} myprogram --opt optval
The MPI_RUN
environment variable also sets some common command line
options to export environment variables:
-x LD_LIBRARY_PATH -x BASH_ENV
; so, you do not need to set them
manually.
Example: job requests 2 nodes, 48 MPI ranks per node:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
See sbatch
documentation[5] (or man page on picotte001) for more
detailed information.
Performance Differences
Performance differences tend to be very application-specific. However, there is some experience which indicates that on Intel CPUs, Intel MPI tends to perform better.[6]
Hybrid MPI-OpenMP Jobs
Please see Hybrid MPI-OpenMP Jobs.
MPI for Python
MPI for Python (a.k.a. mpi4py) is a Python module which takes advantage of MPI.[7]
Please see: MPI for Python
OBSOLETE
Parallel Environment (PE)
OBSOLETE Slurm does not use parallel environments
Use one of:
- openmpi_shm -- single node; this replaces "shm" for OpenMPI jobs
- openmpi_ib -- multiple nodes, CPU cores are assigned in "fill_up" order
- openmpi_ib_rr -- multiple nodes, CPU cores are assigned in "round_robin" order
Other PEs may be used for specific distribution of slots. E.g. if you want to run Hybrid MPI-OpenMP Jobs, or if you just want to have a specific distribution of slots. The Open MPI versions installed locally are integrated with Grid Engine, and will be able to read the environment to determine the how job slots have been distributed across nodes.
Common Environment Variables
OpenMPI may be controlled by environment variables named OMPI_*
.
Please note that some of these should not be changed because they define
necessary compile-time flags, and library locations.
For convenience, these environment variables are set:
MPICC = mpicc
MPICXX = mpic++
MPIF77 = mpif77
MPIF90 = mpif90
MPIFC = mpif90
MPI_HOME = base directory of OpenMPI installation
Running under Grid Engine
Invocation of mpirun
is the same as the others. Using the full
path to the "mpirun" command is recommended. It is given by the
MPI_RUN
environment variable:
${MPI_RUN} myprogram --opt optval
Note that OpenMPI on Proteus is integrated with Grid Engine, using the
proteus-openmpi
modules, the "-n $NSLOTS" option may be left out.
OpenMPI will automatically select the fastest network fabric available, which will be InfiniBand in the case of Proteus. You may, if you wish, specify the specific Byte Transfer Layer (BTL, i.e. point-to-point bytewise communication method) in order of decreasing preference:
${MPI_RUN} -mca btl openib,tcp,self myprogram --opt optval
Please see the OpenMPI FAQ[8] for more details.
You may see all available MCA parameters, including a listing of all available BTLs:
ompi_info -mca btl list
If your program generates errors referring to LIBC or LIBCXX, you will need to make sure the LD_LIBRARY_PATH environment variable is exported to all child processes:
mpirun -x LD_LIBRARY_PATH myprogram
The MPI_RUN
environment variable is defined to be
"mpirun -x LD_LIBRARY_PATH
" so it should just work.
Please see the man page for mpirun for many more options.
For running jobs using > 128 slots, an OpenMPI system parameter may need to be modified. Add the following to the ${MPI_RUN} options:
-mca plm_rsh_num_concurrent 256
MXM warnings about ulimit
The low-level Mellanox MXM communications library may warn about the stacksize limit:
mxm.c:196 MXM WARN The 'ulimit -s' on the system is set to 'unlimited'. This may have negative performance implications. Please set the stack size to the default value (10240)
If that is the case, modify the MPI_RUN line:
${MPI_RUN} -x LD_LIBRARY_PATH bash -c "ulimit -s 10240 && myprogram"
Core Binding
Core binding should be left to Grid Engine, which is aware of the usage
by other jobs. This is done with the "-binding" option for qsub. See the
binding options for qsub
in the man page. There is brief information
in the UGE User Guide.
When the "-binding" option is given to qsub, the PE_HOSTFILE will contain the core binding topology, which is then read by mpirun. WARNING: do not use "explicit" binding: your job should rely on Grid Engine to select free processor cores, rather than picking a specific core in a specific socket.
Example:
#$ -pe openmpi_ib 32
#$ -binding pe striding_automatic:32:1
MVAPICH2
Common Environment Variables
MPI_HOME
MPI_INCDIR = $MPI_HOME/include
MPI_CPPFLAGS = -L$MPI_INCDIR
MPI_LIBDIR = $MPI_HOME/lib
MPI_LDFLAGS = -L$MPI_LIBDIR
MPI_LIBS = -lmpich -lmpl -libmad -libumad -libverbs -ldl -lrt -lnuma -lrdmacm -lm -lpthread
MPICC = mpicc
MPICXX = mpic++
MPIFC = mpif90
MPIF90 = mpif90
MPIF77 = mpif77
GCC
Module to use:
proteus-mvapich2/gcc/64/1.9-mlnx-ofed
Intel
Module to use:
proteus-mvapich2/intel/64/1.9-mlnx-ofed
AMD
Module to use:
proteus-mvapich2/open64
Compiling and Linking
Both dynamic and static libraries are available. Example makefile snippet:
CPPFLAGS = $(MPI_CPPFLAGS) -I/opther/incdir
LDFLAGS = $(MPI_LDFLAGS) -L/other/libdir
LIBS = -lotherlib $(MPI_LIBS) -lm
foo: foo.o
$(MPICC) $(CPPFLAGS) -o $@ $* $(LDFLAGS) $(LIBS)
Use of dynamic libraries may mean adding an rpath.
Running
Within a job script:
mpirun myprogram.exe
Normally, mpirun
will detect what scheduler and resource manager it is
running under, so "-rmk sge" need not be used. The option "-rmk sge
"
tells mpirun
to read the Grid Engine environment for the hosts file.
There are many other options: use the "--help
" option, and see
http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
NB mpirun is actually an alias to mpiexec. You can also use the
MPI_RUN
environment variable to be sure to get the right version:
$MPI_RUN myprogram.exe
mpi4py
If you are using the Python package, mpi4py, you may need to add an option to mpirun, e.g.
$MPI_RUN -genv LD_PRELOAD ${MPI_HOME}/lib/libmpi.so python my_python_mpi_script.py
Local Documentation
$MPI_HOME/share/doc/mvapich2
Also available via "man" command.
Intel MPI
NOTE We do not have a current license for this as for 2015-01-01.
This is for Intel CPUs only.
Running
Use one of these PEs:
- intelmpi
- intelmpi_rr
These PEs have a pre-processing step which informs Intel MPI about the distribution of slots and ndoes.
Number of processes need not be specified for mpirun, but use "-rmk sge" to specify Grid Engine integration:
mpirun -rmk sge my_program arg1 arg2
Parallel Environments for Jobs
For most parallel jobs, it is probably more efficient to use multiples
of whole nodes. To do that use either the "fixed16
" or "fixed64
"
parallel environments. While this may mean waiting for full nodes to
become free, the jobs would probably run faster than one where the slot
allocations are spread widely. Please see the article on Writing Job Scripts for details on appropriate
parallel environments.
References
[1] The Message Passing Interface (MPI) official website
[2] Message Passing Interface Wikipedia article
[3] Intel® Hyper-Threading Technology
[5] Slurm 21.08.8 Documentation - sbatch
[6] StackOverflow: Will mvapich be substantially better than openmpi? And How?