Message Passing Interface

Overview

The Message Passing Interface is a standardized and portable message-passing system for parallel computation using multiple physical computers (nodes).[1][2]

There are multiple implementations of the standard, some of which are available on Proteus.

Available Implementations

Picotte

These packages are compiler-specific:

Open MPI
- GCC
  - picotte-openmpi/gcc/4.0.5
  - picotte-openmpi/gcc/4.1.0
  - picotte-openmpi/gcc/4.1.4
- Intel ICC
  - picotte-openmpi/intel/2020/4.0.5
  - picotte-openmpi/intel/2020/4.1.0
  - picotte-openmpi/intel/2020/4.1.2
  - picotte-openmpi/intel/2020/4.1.4
- CUDA-enabled -- first, do "module use /ifs/opt_cuda/modulefiles"
  - picotte-openmpi/cuda11.0/4.0.5 (Uses GCC)
  - picotte-openmpi/cuda11.0/4.1.0 (Uses GCC)
  - picotte-openmpi/cuda11.2/4.1.0 (Uses Intel ICC)
  - picotte-openmpi/cuda11.2/4.1.4 (Uses Intel ICC)
  - picotte-openmpi/cuda11.4/4.1.4 (GCC)

For the CUDA-enabled implementations, you can see the compiler version by doing:

[juser@gpu001 ~]$ module load picotte-openmpi/cuda11.4/4.1.4
[juser@gpu001 ~]$ mpicc --version
icc (ICC) 19.1.3.304 20200925
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

The CUDA-enabled implementations also require loading

Picotte hardware notes

Details of all hardware are at: Picotte Hardware and Software.

All Picotte compute nodes use one of these two CPUs:

Intel Xeon Platinum 8268 (def, bm partitions)
Intel Xeon Platinum 8260 (gpu partition)

You should not need to manually specify NUMA layouts since Open MPI uses hwloc to determine it.

You can view the "socket, core, thread" configuration using the sinfo_detail alias from the slurm_util modulefile:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sinfo_detail -p def
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
node001           1 def*       mixed   48   4:12:1   (null)   192000   140440   864000     4.08 none
node002           1 def*       mixed   48   4:12:1   (null)   192000   171040   864000     3.00 none
node003           1 def*       mixed   48   4:12:1   (null)   192000   171621   864000     3.03 none
node004           1 def*       mixed   48   4:12:1   (null)   192000   147536   864000     3.00 none
node005           1 def*       mixed   48   4:12:1   (null)   192000   162570   864000     3.00 none
node006           1 def*       mixed   48   4:12:1   (null)   192000   169135   864000     2.99 none
...
node072           1 def*       mixed   48   4:12:1   (null)   192000   157423   864000     3.00 none
node073           1 def*       mixed   48   4:12:1   (null)   192000   157114   864000     3.00 none
node074           1 def*       mixed   48   4:12:1   (null)   192000   152783   864000     3.00 none
[juser@picotte001 ~]$ sinfo_detail -p gpu
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
gpu001            1 gpu        mixed   48   2:24:1 gpu:v100   192000    18191  1637000     4.69 none
gpu002            1 gpu         idle   48   2:24:1 gpu:v100   192000    94592  1637000     0.00 none
...
gpu011            1 gpu         idle   48   2:24:1 gpu:v100   192000    39289  1637000     0.01 none
gpu012            1 gpu         idle   48   2:24:1 gpu:v100   192000   142535  1637000     0.15 none
[juser@picotte001 ~]$ sinfo_detail -p bm
NODELIST      NODES PART       STATE CPUS    S:C:T     GRES   MEMORY FREE_MEM TMP_DISK CPU_LOAD REASON
bigmem001         1 bm          idle   48   2:24:1   (null)  1546000  1368526  1724000     0.00 none
bigmem002         1 bm          idle   48   2:24:1   (null)  1546000  1541778  1724000     0.00 none

The column “S:C:T” shows “Socket”, “Core”, and “Thread”. Here, “Thread” means Intel’s Hyper-Threading,[3] where a single physical core is presented by the hardware as two virtual cores. This feature may increase performance in consumer applications (Office, web browsing, etc.) but will decrease performance in compute-intensive applications. In an HPC context, Hyper-Threading is always turned off, so T=1.

Open MPI

Note that Open MPI is not OpenMP[4]. OpenMP is an API for multi-platform shared-memory parallel programming in C/C++ and Fortran, i.e. single-host multithreaded programming on our compute nodes. Open MPI is an implementation of the MPI-2 standard, which provides multi-host parallel execution. Open MPI uses OpenMP for single-host shared-memory parallel execution.

Common Environment Variables

OpenMPI may be controlled by environment variables named OMPI_*. Please note that some of these should not be changed because they define necessary compile-time flags, and library locations.

For convenience, these environment variables are set -- actual values will vary by version loaded:

MPICC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpicc
MPI_CPPFLAGS=-I/ifs/opt/openmpi/intel/2020/4.1.4/include
MPICXX=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpic++
MPIF77=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif77
MPIF90=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpif90
MPIFC=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpifort
MPI_HOME=/ifs/opt/openmpi/intel/2020/4.1.4
MPI_INCDIR=/ifs/opt/openmpi/intel/2020/4.1.4/include
MPI_LIBDIR=/ifs/opt/openmpi/intel/2020/4.1.4/lib
MPI_RUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
MPIRUN=/ifs/opt/openmpi/intel/2020/4.1.4/bin/mpirun -x LD_LIBRARY_PATH -x BASH_ENV
OMPI_CFLAGS=-fopenmp
OMPI_LDFLAGS=-L/ifs/opt/openmpi/intel/2020/4.1.4/lib -Wl,-rpath -Wl,/ifs/opt/openmpi/intel/2020/4.1.4/lib

Running

Invocation of mpirun is the same as the others. Using the full path to the "mpirun" command is recommended. It is given by the MPI_RUN environment variable:

     ${MPI_RUN} myprogram --opt optval

The MPI_RUN environment variable also sets some common command line options to export environment variables: -x LD_LIBRARY_PATH -x BASH_ENV; so, you do not need to set them manually.

Example: job requests 2 nodes, 48 MPI ranks per node:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48

See sbatch documentation[5] (or man page on picotte001) for more detailed information.

Performance Differences

Performance differences tend to be very application-specific. However, there is some experience which indicates that on Intel CPUs, Intel MPI tends to perform better.[6]

Hybrid MPI-OpenMP Jobs

Please see Hybrid MPI-OpenMP Jobs.

MPI for Python

MPI for Python (a.k.a. mpi4py) is a Python module which takes advantage of MPI.[7]

Please see: MPI for Python