Compiling High Performance Linpack (HPL)
The nodes referred to in this article are Proteus nodes, and Proteus has been decommissioned.
Overview♯
High Performance Linpack (HPL)[1] is a portable implementation of the standard Linpack benchmark, used in TOP500[2] ranking determination. The benchmark solves "a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers." As it is well-tested, and kept updated, it also makes a good example case of compiling HPC software.
Here, we link to makefiles used to compile HPL on Proteus using various
combinations of the available compilers, math libraries, and MPI
implementations. The resulting xhpl
executables were used to benchmark
the performance of Proteus.
In the CPU-specific sections below, listed are the module names for the compilers, math libaries, and MPI implementations.
HPL reads the input file HPL.dat for parameters, which need to be tuned for best performance. See: https://www.netlib.org/benchmark/hpl/faqs.html
Theoretical Peak Performance and Efficiency♯
Theoretical peak performance is based only on CPU specifications.[3] Efficiency is the ratio of actual performance to the theoretical peak. The theoretical peak performance quoted in this article uses the base CPU clock speed, rather than the "turbo boost" clock speed.
The formula is given by:
\(\\mathrm{FLOP} = \\mathrm{sockets} \\times \\frac{\\mathrm{cores}}{\\mathrm{sockets}} \\times \\frac{\\mathrm{cycles}}{\\mathrm{second}} \\times \\frac{\\mathrm{FLOPs}}{\\mathrm{cycle}}\)
FLOPs/cycle numbers:
- from Wikipedia: https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors
- from StackOverflow: https://stackoverflow.com/a/15657772
Formula for theoretical max performance of Skylake CPUs is: G = C × TPV × L × W × H × U × F where G is the expected performance in GFLOP/s, C is the number of cores per socket, TPV is the maximum Turbo Boost frequency for all cores loaded with AVX-512 instructions, L is the number of sockets, W is the vector width (W=16 for single precision, W=8 for double precision), H is the throughput of the instruction (H=1 instruction per cycle for addition, multiplication, subtraction and FMA), U is the number of FMA units (U=2 for Platinum 81xx and Gold 61xx processors, U=1 for all others), and F is the number of FLOPs per instruction (F=2 for FMA and F=1 for all other instructions).[4]
See also the Microway report on Intel Skylake CPUs,[5]
Clock speeds may also depend on hardware features being used, e.g. AVX-512. See WikiChip[6] for details for each CPU.
AMD♯
Opteron 6378 Piledriver - 2.4 GHz clock speed - 16 cores per socket, 4 sockets per node
- 64 cores, single node: 614.4 GFLOPS
- 256 cores, 4 nodes: 2457.6 GFLOPS
- 1024 cores, 16 nodes: 9830 GFLOPS
Intel♯
Xeon E5-2670 Sandy Bridge - 2.6 GHz base clock speed, 3.3 GHz "turbo" clock speed -- peak numbers below calculated with turbo clock, 8 cores per socket, 2 sockets per node, 8 floating point operations per cycle (FLOP) EDIT: one source says 8 double precision FLOP/cycle (16 FLOP/cycle if single precision).
- 16 cores, single node: 422.4 GFLOPS DP
- 64 cores, 4 nodes: 1690 GFLOPS DP
- 1280 cores, 80 nodes: 33.8 TFLOPS DP
Advice from Intel for setting parameters based on CPU: https://software.intel.com/en-us/mkl-windows-developer-guide-configuring-parameters
Ivy Bridge - 8 DP FLOP per cycle
Haswell & Broadwell - 16 DP FLOP per cycle
Skylake (server) - 32 DP FLOP per cycle (16 DP FLOP per cycle per AVX-512 unit; the Gold 6148 has 2 AVX-512 units)
New Intel Skylake Nodes♯
The new ic23
nodes have 2x Intel Xeon Gold 6148
CPUs - 2.40 GHz base clock speed, 3.70 GHz "turbo" clock speed. These
CPUs have two AVX512 units, each performing 32 double precision FLOP per
cycle (and 64 single precision FLOP per cycle):
- 40 cores, single node: 4,736 GFLOPS DP
Performance of Your Specific Application♯
There are almost no general recommendations that can be made for optimal performance of your specific computation on your specific application. Even for the same software package, the computation of different problems can have very different behaviors.
You must benchmark your own computation to figure out an optimal configuration (number of slots, number of slots per node, number of threads per rank, etc).
AMD CPUs♯
Compiler | Math Library | MPI implementation | Makefile | Speed (1-node, 64-core) | Speed (4-node, 256-core) | Speed (16-node, 1024-core) | |
---|---|---|---|---|---|---|---|
open64 | acml/open64/fma4 | proteus-mvapich2/open64 | [[/File:Make.open64_acml_mvapich.txt | 446.9 GFLOPS - 72.7% eff. | 1708 GFLOPS - 69.5% eff. | 7084 GFLOPS - 72.1 % eff. | |
open64 | acml/open64/fma4 | proteus-openmpi/open64 | [[/File:Make.open64_acml_openmpi.txt | 426.4 GFLOPS - 69.4% eff. | 1003 GFLOPS† - 40.8% eff. | TBA |
Intel CPUs♯
Compiler | Math Library | MPI implementation | Makefile | Speed (1-node, 16-core) | Speed (4-node, 64-core) | Speed (16-node, 256-core) | Speed (80-node, 1280-core) | |
---|---|---|---|---|---|---|---|---|
intel/compiler | intel/mkl | proteus-mvapich2/intel | [[/File:Make.intel64_mkl_mvapich.txt | 281.8 GFLOPS | 954.2 GFLOPS | TBA | TBA | |
intel/compiler | intel/mkl | proteus-openmpi/intel | [[/File:Make.intel64_mkl_openmpi.txt | 281.0 GFLOPS | 992.2 GFLOPS | 959.8 GFLOPS† | TBA | |
intel/compiler | intel/mkl | intel-mpi | [[/File:Make.intel64_mkl_intelmpi.txt | 288.6 GFLOPS | 934.5 GFLOPS | TBA | TBA | |
† cluster under other load
Tuning HPL.dat♯
- Details on the parameters in the HPL.dat file: https://www.netlib.org/benchmark/hpl/tuning.html
- Use this website to generate an appropriate HPL.dat: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
Running the Job♯
MVAPICH♯
$MPI_RUN -np $NSLOTS ./xhpl
Intel MPI♯
$MPI_RUN -np $NSLOTS ./xhpl
OpenMPI♯
$MPI_RUN ./xhpl
Other Benchmarks♯
See Also♯
References♯
[2] TOP500 Supercomputer Sites
[3] How to calculate peak theoretical performance of a CPU-based HPC system
[4] Colfax Research: A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors