Compiling High Performance Linpack (HPL)

The nodes referred to in this article are Proteus nodes, and Proteus has been decommissioned.

Overview♯

High Performance Linpack (HPL)[1] is a portable implementation of the standard Linpack benchmark, used in TOP500[2] ranking determination. The benchmark solves "a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers." As it is well-tested, and kept updated, it also makes a good example case of compiling HPC software.

Here, we link to makefiles used to compile HPL on Proteus using various combinations of the available compilers, math libraries, and MPI implementations. The resulting xhpl executables were used to benchmark the performance of Proteus.

In the CPU-specific sections below, listed are the module names for the compilers, math libaries, and MPI implementations.

HPL reads the input file HPL.dat for parameters, which need to be tuned for best performance. See: https://www.netlib.org/benchmark/hpl/faqs.html

Theoretical Peak Performance and Efficiency♯

Theoretical peak performance is based only on CPU specifications.[3] Efficiency is the ratio of actual performance to the theoretical peak. The theoretical peak performance quoted in this article uses the base CPU clock speed, rather than the "turbo boost" clock speed.

The formula is given by:

$\\mathrm{FLOP} = \\mathrm{sockets} \\times \\frac{\\mathrm{cores}}{\\mathrm{sockets}} \\times \\frac{\\mathrm{cycles}}{\\mathrm{second}} \\times \\frac{\\mathrm{FLOPs}}{\\mathrm{cycle}}$

FLOPs/cycle numbers:

from Wikipedia: https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors
from StackOverflow: https://stackoverflow.com/a/15657772

Formula for theoretical max performance of Skylake CPUs is: G = C × T_PV × L × W × H × U × F where G is the expected performance in GFLOP/s, C is the number of cores per socket, TPV is the maximum Turbo Boost frequency for all cores loaded with AVX-512 instructions, L is the number of sockets, W is the vector width (W=16 for single precision, W=8 for double precision), H is the throughput of the instruction (H=1 instruction per cycle for addition, multiplication, subtraction and FMA), U is the number of FMA units (U=2 for Platinum 81xx and Gold 61xx processors, U=1 for all others), and F is the number of FLOPs per instruction (F=2 for FMA and F=1 for all other instructions).[4]

See also the Microway report on Intel Skylake CPUs,[5]

Clock speeds may also depend on hardware features being used, e.g. AVX-512. See WikiChip[6] for details for each CPU.

AMD♯

Opteron 6378 Piledriver - 2.4 GHz clock speed - 16 cores per socket, 4 sockets per node

64 cores, single node: 614.4 GFLOPS
256 cores, 4 nodes: 2457.6 GFLOPS
1024 cores, 16 nodes: 9830 GFLOPS

Intel♯

Xeon E5-2670 Sandy Bridge - 2.6 GHz base clock speed, 3.3 GHz "turbo" clock speed -- peak numbers below calculated with turbo clock, 8 cores per socket, 2 sockets per node, 8 floating point operations per cycle (FLOP) EDIT: one source says 8 double precision FLOP/cycle (16 FLOP/cycle if single precision).

16 cores, single node: 422.4 GFLOPS DP
64 cores, 4 nodes: 1690 GFLOPS DP
1280 cores, 80 nodes: 33.8 TFLOPS DP

Advice from Intel for setting parameters based on CPU: https://software.intel.com/en-us/mkl-windows-developer-guide-configuring-parameters

Ivy Bridge - 8 DP FLOP per cycle

Haswell & Broadwell - 16 DP FLOP per cycle

Skylake (server) - 32 DP FLOP per cycle (16 DP FLOP per cycle per AVX-512 unit; the Gold 6148 has 2 AVX-512 units)

New Intel Skylake Nodes♯

The new ic23 nodes have 2x Intel Xeon Gold 6148 CPUs - 2.40 GHz base clock speed, 3.70 GHz "turbo" clock speed. These CPUs have two AVX512 units, each performing 32 double precision FLOP per cycle (and 64 single precision FLOP per cycle):

40 cores, single node: 4,736 GFLOPS DP

Performance of Your Specific Application♯

There are almost no general recommendations that can be made for optimal performance of your specific computation on your specific application. Even for the same software package, the computation of different problems can have very different behaviors.

You must benchmark your own computation to figure out an optimal configuration (number of slots, number of slots per node, number of threads per rank, etc).

AMD CPUs♯

Compiler	Math Library	MPI implementation	Makefile	Speed (1-node, 64-core)	Speed (4-node, 256-core)	Speed (16-node, 1024-core)
open64	acml/open64/fma4	proteus-mvapich2/open64	[[/File:Make.open64_acml_mvapich.txt	]]	446.9 GFLOPS - 72.7% eff.	1708 GFLOPS - 69.5% eff.	7084 GFLOPS - 72.1 % eff.
open64	acml/open64/fma4	proteus-openmpi/open64	[[/File:Make.open64_acml_openmpi.txt	]]	426.4 GFLOPS - 69.4% eff.	1003 GFLOPS† - 40.8% eff.	TBA

Intel CPUs♯

Compiler	Math Library	MPI implementation	Makefile	Speed (1-node, 16-core)	Speed (4-node, 64-core)	Speed (16-node, 256-core)	Speed (80-node, 1280-core)
intel/compiler	intel/mkl	proteus-mvapich2/intel	[[/File:Make.intel64_mkl_mvapich.txt	]]	281.8 GFLOPS	954.2 GFLOPS	TBA	TBA
intel/compiler	intel/mkl	proteus-openmpi/intel	[[/File:Make.intel64_mkl_openmpi.txt	]]	281.0 GFLOPS	992.2 GFLOPS	959.8 GFLOPS†	TBA
intel/compiler	intel/mkl	intel-mpi	[[/File:Make.intel64_mkl_intelmpi.txt	]]	288.6 GFLOPS	934.5 GFLOPS	TBA	TBA