Compiling Quick Start Guide

Compiling code can be a tricky business. One needs to be aware of target hardware, libraries to be used, and possibly network fabric to be used. Since URCF staff likely have no experience with specific research code used by the various groups in various fields of study, staff can offer only general advice on compiling.

Proteus Hardware

Proteus compute nodes are of two types:[1]

Intel Xeon CPUs

CPU: Intel(R) Xeon(R) CPU E5-2670
architecture name for GCC compilers: core-avx-i

AMD Opteron CPUs

CPU: AMD Opteron™ Processor 6378
architecture name for GCC and Open64 compilers: bdver2

See the GCC 4.8.2 Manual: i386 and x86_64 Options

Each of those types of nodes has a corresponding login node with matching architecture:

proteusi01 -- login node with Intel Xeon CPUs
proteusa01 -- login node with AMD Opteron CPUs

Compilers

The default compiler is gcc 4.8.1, which is provided by the module[2] gcc/4.8.1. This is different from the version of gcc (4.4.7) which ships with Red Hat Enterprise Linux. However, gcc produces executables which may not be the most optimized for either Intel or AMD architectures. Use it only if you wish your executables to run on both the Intel and the AMD nodes.

The CPU vendors produce their own highly-tuned compilers for their own chips. They also provide highly-tuned math libraries: Math Kernel Library (MKL) for Intel, and AMD Core Math Library (ACML) for AMD. Do not use the generic BLAS and LAPACK modules. For GCC, you may try one of the OpenBLAS modules (forked from GotoBLAS, a hand-tuned implementation of BLAS). Refer to Proteus Hardware and Software for up-to-date information on the installed CPUs.

Intel Composer XE[3]
- full development suite including profiler, debugger, optimized math and linear algebra libraries, an implementation of MPI2
- to use this compiler: "module load intel/compiler intel/mkl"; there are other subproducts in the Composer suite -- do "module avail" to see. As of 2014-01-31, the Intel compute nodes are Xeon E5-2670 with Sandy Bridge microarchitecture.
AMD Open64[4]
- compiler suite
- to use this compiler: "module load open64"; you will also want an appropriate AMD Core Math Library (ACML),[5] provided by one of the acml modules. As of 2014-01-31, the AMD compute nodes are Opteron 6378 with Piledriver microarchitecture: use the "bdver2" architecture name.

WARNING: Code compiled with specialized optimization flags for one architecture may not run on another architecture, even on other CPUs from the same manufacturer. Cases in point: SSE optimizations, and FMA4 optimizations.

Please refer to the vendors' documentation for details on using these products.

Architecture-Specific Optimization Options

Each compiler product has its own way of specifying architecture-specific optimizations to use. To get information about the hardware, do: less /proc/cpuinfo on the compute nodes you wish to target (write a trivial job script to do so).

Use the "-march=cputype" option to select the specific instruction set. See the section on #Proteus Hardware above for the appropriate cputype to use.

You may also select specific instruction set support, such as:

-msse4.2 -mavx

N.B.

FMA and FMA4 (fused multiply add) are supported only on the AMD CPUs.
getting the wrong architecture type or even just a generic/base architecture type can have a large influence on speed. E.g. time take for FFTW3 test suite varied from 36.74 seconds to 403.08 seconds on Intel based on which architecture options were given.

Environment Setup for Compilation

Frequently, the compilation process requires the user to set environment variables corresponding to the compiler commands. E.g. for Open64:

[juser@proteusa01 ~]$ export CC=opencc [juser@proteusa01 ~]$ export CXX=openCC [juser@proteusa01 ~]$ export FC=openf90

For MPI, the appropriate modules should have environment variables like MPICC and MPICXX set.

Intel Composer XE

Modules:
- intel/compiler/64
- intel/mkl/64
- intel-mpi/64
- Other Intel Composer XE components are also available. See the output of "module avail"
Compiler commands
- C: icc
- C++: icpc
- Fortran (77/90/95): ifort
Help flag: "-help"

An example compiling the High Performance Linpack suite for TOP500 runs: http://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note

Open64

Modules
- open64
- acml/open64/* -- there are various versions, with and without FMA4 optimization, and with and without OpenMP symmetric multiprocessing support
Compiler commands
- C: opencc
- C++: openCC
- Fortran (77/90/95): openf90, openf95
Help flag: "--help" (two -'s)

GCC 4.8.1

Module
- gcc/4.8.1
Compiler commands:
- C: gcc
- C++: g++
- Fortran: gfortran
Help flag: "--help" (two -'s)

Architecture-specific options: http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options

LLVM and clang

LLVM 3.6.2, 3.7.1, and 3.8.1 with corresponding clang versions are installed. Use one of the following modulefiles:

llvm/3.6.2 llvm/3.7.1 llvm/3.8.1

MPI-2 Implementations

MPI-2 is a standard specification for message-passing parallel programming.[6] There are several implementations of the standard:

MPICH2 -- only supports IP over InfiniBand: avoid using if possible
MVAPICH2[7][8] -- supports Infiniband IB verbs
OpenMPI[9] -- supports Infiniband IB verbs
Intel MPI[10] (part of the Intel Composer XE package) -- supports Infiniband IB verbs

Each implementation's source code may be compiled by one of the three compiler suites mentioned above. So, there are nine possible compiled implementations of MPI-2. You would want to pick one that matches the target hardware for your job.

When selecting a particular MPI-2 implementation to use for compiling, be sure to get one which matches the hardware you are targeting. Do "module avail" to see what's available. You will see packages like:

mvapich2/gcc/64/1.9
mvapich2/intel/64/1.9
openmpi/intel/64/1.6.5
openmpi/open64/64/1.6.5
proteus-mvapich2/intel/64/1.9-mlnx-ofed
proteus-openmpi/intel/64/1.6.5-mlnx-ofed
proteus-openmpi/open64/64/1.6.5-mlnx-ofed

The modules prefixed with "proteus-" are packages compiled by URCF staff.

For the OpenMPI packages, however, the proteus-* packages have been compiled with Grid Engine integration:[11]

It simplifies the "mpirun" command line: the number of slots/processes will be read from the SGE environment.
It supports the -notify flag to qsub, i.e. the job script will receive a SIGUSR1 signal before the SIGTSTP signal.
It is possible to use SIGTSTP and SIGCONT signals to pause and resume a job.

If you wish to use certain parallel libraries, e.g. FFTW3, you must make sure the MPI-2 implementation and the compiler must match. Say, if you want to compile for AMD using OpenMPI, linking with FFTW3, you would select the module fftw3/openmpi/open64/64/3.3.3. Unfortunately, for FFTW2 and FFTW3, only OpenMPI-compiled versions are offered. For Intel, MKL-linked versions of FFTW are available: see Compiling for Intel with Intel Composer XE, MKL, and Intel MPI.

URCF staff have not evaluated the relative speed/efficiency of the various MPI-2 implementations. You may wish to run your own benchmarks before deciding on which implementation to use. However, you should select one of MVAPICH2 or OpenMPI for proper Infiniband support.

Details on compiling and linking with the available MPI-2 implementations can be seen in the article on the Message Passing Interface.

Writing an MPI-2 Program

Writing an MPI-2 program, or in fact any parallel code, is complex. It is beyond the scope of this documentation to provide on MPI-2 coding. Please see external resources.[12][13]

Compiling an MPI-2 Program

All MPI-2 implementations provide the following compilers:

mpicc - for C code
mpiCC - for C++ code
mpif77 - for Fortran 77 code
mpif90 - for Fortran 90 code

Frequently, makefiles for MPI-2 code will expect environment variables such as MPICC. The modules for the MPI implementations should have these set, but it pays to check: env | grep MPI

NB In the next version of OpenMPI mpif77 and mpif90 will be deprecated in favor of mpifort

Running an MPI-2 Program

In all cases, the specific MPI module must be loaded after the module for the underlying compiler. E.g. if you want to use MVAPICH2 compiled with Intel compilers:

module load intel/compiler ... module load proteus-mvapich2/intel/64/1.9-mlnx-ofed

All MPI-2 implementations provide the mpirun command. For MVAPICH2, mpirun will figure out the Grid Engine environment:

# Job script snippet - MVAPICH2
mpirun ./myprogram

Or, you may specify it explicitly:

# Job script snippet - MVAPICH2
mpirun -rmk sge ./myprogram

For OpenMPI, using the proteus-openmpi/* modules, integration with SGE means you can just do:

# Job script snippet - OpenMPI
$MPIRUN ./myprogram

OpenMPI's mpirun is aware of the Grid Engine environment, and knows about $NSLOTS.

Intel MPI integration with Grid Engine is done via the appropriate Parallel Environment. N.B. this may or may not work well -- the integration method is not well-documented by either Intel or Univa:

#$ -pe intelmpi 24
...
$MPIRUN ./myprog

References

[1] Proteus Hardware and Software

[2] Environment Modules Quick Start Guide

[3] Intel Composer XE website

[4] AMD x86 Open64 Compiler Suite website

[5] AMD Developer Central: Building with ACML

[6] Message Passing Interface

[7] MVAPICH2 official website

[8]

[9] OpenMPI online documentation

[10] Intel MPI official website

[11] OpenMPI FAQ -- Running jobs under SGE

[12] Online MPI Tutorial (also available as a Kindle eBook), Kendall

[13] Using MPI: Portable Parallel Programming with the Message-Passing Interface, Gropp et al.