Compiling Quick Start Guide
Compiling code can be a tricky business. One needs to be aware of target hardware, libraries to be used, and possibly network fabric to be used. Since URCF staff likely have no experience with specific research code used by the various groups in various fields of study, staff can offer only general advice on compiling.
Proteus Hardware
Proteus compute nodes are of two types:[1]
Intel Xeon CPUs
- CPU: Intel(R) Xeon(R) CPU E5-2670
- architecture name for GCC compilers:
core-avx-i
AMD Opteron CPUs
- CPU: AMD Opteron™ Processor 6378
- architecture name for GCC and Open64 compilers:
bdver2
See the GCC 4.8.2 Manual: i386 and x86_64 Options
Each of those types of nodes has a corresponding login node with matching architecture:
proteusi01
-- login node with Intel Xeon CPUsproteusa01
-- login node with AMD Opteron CPUs
Compilers
The default compiler is gcc 4.8.1, which is provided by the
module[2] gcc/4.8.1
. This is different from the version of gcc (4.4.7)
which ships with Red Hat Enterprise Linux. However, gcc produces
executables which may not be the most optimized for either Intel or AMD
architectures. Use it only if you wish your executables to run on both
the Intel and the AMD nodes.
The CPU vendors produce their own highly-tuned compilers for their own chips. They also provide highly-tuned math libraries: Math Kernel Library (MKL) for Intel, and AMD Core Math Library (ACML) for AMD. Do not use the generic BLAS and LAPACK modules. For GCC, you may try one of the OpenBLAS modules (forked from GotoBLAS, a hand-tuned implementation of BLAS). Refer to Proteus Hardware and Software for up-to-date information on the installed CPUs.
- Intel Composer XE[3]
- full development suite including profiler, debugger, optimized math and linear algebra libraries, an implementation of MPI2
- to use this compiler: "
module load intel/compiler intel/mkl
"; there are other subproducts in the Composer suite -- do "module avail" to see. As of 2014-01-31, the Intel compute nodes are Xeon E5-2670 with Sandy Bridge microarchitecture.
- AMD Open64[4]
- compiler suite
- to use this compiler: "
module load open64
"; you will also want an appropriate AMD Core Math Library (ACML),[5] provided by one of theacml
modules. As of 2014-01-31, the AMD compute nodes are Opteron 6378 with Piledriver microarchitecture: use the "bdver2
" architecture name.
WARNING: Code compiled with specialized optimization flags for one architecture may not run on another architecture, even on other CPUs from the same manufacturer. Cases in point: SSE optimizations, and FMA4 optimizations.
Please refer to the vendors' documentation for details on using these products.
Architecture-Specific Optimization Options
Each compiler product has its own way of specifying
architecture-specific optimizations to use. To get information about the
hardware, do: less /proc/cpuinfo
on the compute nodes you wish to
target (write a trivial job script to do so).
Use the "-march=
cputype
" option to select the specific instruction
set. See the section on #Proteus Hardware above for the appropriate
cputype to use.
You may also select specific instruction set support, such as:
-msse4.2 -mavx
N.B.
- FMA and FMA4 (fused multiply add) are supported only on the AMD CPUs.
- getting the wrong architecture type or even just a generic/base architecture type can have a large influence on speed. E.g. time take for FFTW3 test suite varied from 36.74 seconds to 403.08 seconds on Intel based on which architecture options were given.
Environment Setup for Compilation
Frequently, the compilation process requires the user to set environment variables corresponding to the compiler commands. E.g. for Open64:
[juser@proteusa01 ~]$ export CC=opencc
[juser@proteusa01 ~]$ export CXX=openCC
[juser@proteusa01 ~]$ export FC=openf90
For MPI, the appropriate modules should have environment variables like
MPICC
and MPICXX
set.
Intel Composer XE
- Modules:
- intel/compiler/64
- intel/mkl/64
- intel-mpi/64
- Other Intel Composer XE components are also available. See the output of "module avail"
- Compiler commands
- C: icc
- C++: icpc
- Fortran (77/90/95): ifort
- Help flag: "-help"
An example compiling the High Performance Linpack suite for TOP500 runs: http://software.intel.com/en-us/articles/performance-tools-for-software-developers-hpl-application-note
Open64
- Modules
- open64
- acml/open64/* -- there are various versions, with and without FMA4 optimization, and with and without OpenMP symmetric multiprocessing support
- Compiler commands
- C: opencc
- C++: openCC
- Fortran (77/90/95): openf90, openf95
- Help flag: "--help" (two -'s)
GCC 4.8.1
- Module
- gcc/4.8.1
- Compiler commands:
- C: gcc
- C++: g++
- Fortran: gfortran
- Help flag: "--help" (two -'s)
Architecture-specific options: http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
LLVM and clang
LLVM 3.6.2, 3.7.1, and 3.8.1 with corresponding clang versions are installed. Use one of the following modulefiles:
llvm/3.6.2
llvm/3.7.1
llvm/3.8.1
MPI-2 Implementations
MPI-2 is a standard specification for message-passing parallel programming.[6] There are several implementations of the standard:
- MPICH2 -- only supports IP over InfiniBand: avoid using if possible
- MVAPICH2[7][8] -- supports Infiniband IB verbs
- OpenMPI[9] -- supports Infiniband IB verbs
- Intel MPI[10] (part of the Intel Composer XE package) -- supports Infiniband IB verbs
Each implementation's source code may be compiled by one of the three compiler suites mentioned above. So, there are nine possible compiled implementations of MPI-2. You would want to pick one that matches the target hardware for your job.
When selecting a particular MPI-2 implementation to use for compiling, be sure to get one which matches the hardware you are targeting. Do "module avail" to see what's available. You will see packages like:
- mvapich2/gcc/64/1.9
- mvapich2/intel/64/1.9
- openmpi/intel/64/1.6.5
- openmpi/open64/64/1.6.5
- proteus-mvapich2/intel/64/1.9-mlnx-ofed
- proteus-openmpi/intel/64/1.6.5-mlnx-ofed
- proteus-openmpi/open64/64/1.6.5-mlnx-ofed
The modules prefixed with "proteus-" are packages compiled by URCF staff.
For the OpenMPI packages, however, the proteus-* packages have been compiled with Grid Engine integration:[11]
- It simplifies the "mpirun" command line: the number of slots/processes will be read from the SGE environment.
- It supports the
-notify
flag toqsub
, i.e. the job script will receive a SIGUSR1 signal before the SIGTSTP signal. - It is possible to use SIGTSTP and SIGCONT signals to pause and resume a job.
If you wish to use certain parallel libraries, e.g. FFTW3, you must make
sure the MPI-2 implementation and the compiler must match. Say, if you
want to compile for AMD using OpenMPI, linking with FFTW3, you would
select the module fftw3/openmpi/open64/64/3.3.3
. Unfortunately, for
FFTW2 and FFTW3, only OpenMPI-compiled versions are offered. For Intel,
MKL-linked versions of FFTW are available: see Compiling for Intel with Intel Composer XE, MKL, and Intel MPI.
URCF staff have not evaluated the relative speed/efficiency of the various MPI-2 implementations. You may wish to run your own benchmarks before deciding on which implementation to use. However, you should select one of MVAPICH2 or OpenMPI for proper Infiniband support.
Details on compiling and linking with the available MPI-2 implementations can be seen in the article on the Message Passing Interface.
Writing an MPI-2 Program
Writing an MPI-2 program, or in fact any parallel code, is complex. It is beyond the scope of this documentation to provide on MPI-2 coding. Please see external resources.[12][13]
Compiling an MPI-2 Program
All MPI-2 implementations provide the following compilers:
- mpicc - for C code
- mpiCC - for C++ code
- mpif77 - for Fortran 77 code
- mpif90 - for Fortran 90 code
Frequently, makefiles for MPI-2 code will expect environment variables
such as MPICC
. The modules for the MPI implementations should have
these set, but it pays to check: env | grep MPI
NB In the next version of OpenMPI mpif77 and mpif90 will be deprecated in favor of mpifort
Running an MPI-2 Program
In all cases, the specific MPI module must be loaded after the module for the underlying compiler. E.g. if you want to use MVAPICH2 compiled with Intel compilers:
module load intel/compiler
...
module load proteus-mvapich2/intel/64/1.9-mlnx-ofed
All MPI-2 implementations provide the mpirun command. For MVAPICH2, mpirun will figure out the Grid Engine environment:
# Job script snippet - MVAPICH2
mpirun ./myprogram
Or, you may specify it explicitly:
# Job script snippet - MVAPICH2
mpirun -rmk sge ./myprogram
For OpenMPI, using the proteus-openmpi/*
modules, integration with SGE
means you can just do:
# Job script snippet - OpenMPI
$MPIRUN ./myprogram
OpenMPI's mpirun is aware of the Grid Engine environment, and knows about $NSLOTS.
Intel MPI integration with Grid Engine is done via the appropriate Parallel Environment. N.B. this may or may not work well -- the integration method is not well-documented by either Intel or Univa:
#$ -pe intelmpi 24
...
$MPIRUN ./myprog
References
[1] Proteus Hardware and Software
[2] Environment Modules Quick Start Guide
[4] AMD x86 Open64 Compiler Suite website
[5] AMD Developer Central: Building with ACML
[9] OpenMPI online documentation
[10] Intel MPI official website
[11] OpenMPI FAQ -- Running jobs under SGE
[12] Online MPI Tutorial (also available as a Kindle eBook), Kendall
[13] Using MPI: Portable Parallel Programming with the Message-Passing Interface, Gropp et al.