Skip to content

Compiling High Performance Linpack (HPL)

The nodes referred to in this article are Proteus nodes, and Proteus has been decommissioned.

Overview

High Performance Linpack (HPL)[1] is a portable implementation of the standard Linpack benchmark, used in TOP500[2] ranking determination. The benchmark solves "a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers." As it is well-tested, and kept updated, it also makes a good example case of compiling HPC software.

Here, we link to makefiles used to compile HPL on Proteus using various combinations of the available compilers, math libraries, and MPI implementations. The resulting xhpl executables were used to benchmark the performance of Proteus.

In the CPU-specific sections below, listed are the module names for the compilers, math libaries, and MPI implementations.

HPL reads the input file HPL.dat for parameters, which need to be tuned for best performance. See: https://www.netlib.org/benchmark/hpl/faqs.html

Theoretical Peak Performance and Efficiency

Theoretical peak performance is based only on CPU specifications.[3] Efficiency is the ratio of actual performance to the theoretical peak. The theoretical peak performance quoted in this article uses the base CPU clock speed, rather than the "turbo boost" clock speed.

The formula is given by:

\(\\mathrm{FLOP} = \\mathrm{sockets} \\times \\frac{\\mathrm{cores}}{\\mathrm{sockets}} \\times \\frac{\\mathrm{cycles}}{\\mathrm{second}} \\times \\frac{\\mathrm{FLOPs}}{\\mathrm{cycle}}\)

FLOPs/cycle numbers:

Formula for theoretical max performance of Skylake CPUs is: G = C × TPV × L × W × H × U × F where G is the expected performance in GFLOP/s, C is the number of cores per socket, TPV is the maximum Turbo Boost frequency for all cores loaded with AVX-512 instructions, L is the number of sockets, W is the vector width (W=16 for single precision, W=8 for double precision), H is the throughput of the instruction (H=1 instruction per cycle for addition, multiplication, subtraction and FMA), U is the number of FMA units (U=2 for Platinum 81xx and Gold 61xx processors, U=1 for all others), and F is the number of FLOPs per instruction (F=2 for FMA and F=1 for all other instructions).[4]

See also the Microway report on Intel Skylake CPUs,[5]

Clock speeds may also depend on hardware features being used, e.g. AVX-512. See WikiChip[6] for details for each CPU.

AMD

Opteron 6378 Piledriver - 2.4 GHz clock speed - 16 cores per socket, 4 sockets per node

  • 64 cores, single node: 614.4 GFLOPS
  • 256 cores, 4 nodes: 2457.6 GFLOPS
  • 1024 cores, 16 nodes: 9830 GFLOPS

Intel

Xeon E5-2670 Sandy Bridge - 2.6 GHz base clock speed, 3.3 GHz "turbo" clock speed -- peak numbers below calculated with turbo clock, 8 cores per socket, 2 sockets per node, 8 floating point operations per cycle (FLOP) EDIT: one source says 8 double precision FLOP/cycle (16 FLOP/cycle if single precision).

  • 16 cores, single node: 422.4 GFLOPS DP
  • 64 cores, 4 nodes: 1690 GFLOPS DP
  • 1280 cores, 80 nodes: 33.8 TFLOPS DP

Advice from Intel for setting parameters based on CPU: https://software.intel.com/en-us/mkl-windows-developer-guide-configuring-parameters

Ivy Bridge - 8 DP FLOP per cycle

Haswell & Broadwell - 16 DP FLOP per cycle

Skylake (server) - 32 DP FLOP per cycle (16 DP FLOP per cycle per AVX-512 unit; the Gold 6148 has 2 AVX-512 units)

New Intel Skylake Nodes

The new ic23 nodes have 2x Intel Xeon Gold 6148 CPUs - 2.40 GHz base clock speed, 3.70 GHz "turbo" clock speed. These CPUs have two AVX512 units, each performing 32 double precision FLOP per cycle (and 64 single precision FLOP per cycle):

  • 40 cores, single node: 4,736 GFLOPS DP

Performance of Your Specific Application

There are almost no general recommendations that can be made for optimal performance of your specific computation on your specific application. Even for the same software package, the computation of different problems can have very different behaviors.

You must benchmark your own computation to figure out an optimal configuration (number of slots, number of slots per node, number of threads per rank, etc).

AMD CPUs

Compiler Math Library MPI implementation Makefile Speed (1-node, 64-core) Speed (4-node, 256-core) Speed (16-node, 1024-core)
open64 acml/open64/fma4 proteus-mvapich2/open64 [[/File:Make.open64_acml_mvapich.txt ]] 446.9 GFLOPS - 72.7% eff. 1708 GFLOPS - 69.5% eff. 7084 GFLOPS - 72.1 % eff.
open64 acml/open64/fma4 proteus-openmpi/open64 [[/File:Make.open64_acml_openmpi.txt ]] 426.4 GFLOPS - 69.4% eff. 1003 GFLOPS† - 40.8% eff. TBA

Intel CPUs

Compiler Math Library MPI implementation Makefile Speed (1-node, 16-core) Speed (4-node, 64-core) Speed (16-node, 256-core) Speed (80-node, 1280-core)
intel/compiler intel/mkl proteus-mvapich2/intel [[/File:Make.intel64_mkl_mvapich.txt ]] 281.8 GFLOPS 954.2 GFLOPS TBA TBA
intel/compiler intel/mkl proteus-openmpi/intel [[/File:Make.intel64_mkl_openmpi.txt ]] 281.0 GFLOPS 992.2 GFLOPS 959.8 GFLOPS† TBA
intel/compiler intel/mkl intel-mpi [[/File:Make.intel64_mkl_intelmpi.txt ]] 288.6 GFLOPS 934.5 GFLOPS TBA TBA

† cluster under other load

Tuning HPL.dat

Running the Job

MVAPICH

$MPI_RUN -np $NSLOTS ./xhpl

Intel MPI

$MPI_RUN -np $NSLOTS ./xhpl

OpenMPI

$MPI_RUN ./xhpl

Other Benchmarks

See Also

References

[1] HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers

[2] TOP500 Supercomputer Sites

[3] How to calculate peak theoretical performance of a CPU-based HPC system

[4] Colfax Research: A Survey and Benchmarks of Intel® Xeon® Gold and Platinum Processors

[5] Microway Knowledge Center: Detailed Specifications of the “Skylake-SP” Intel Xeon Processor Scalable Family CPUs

[6] WikiChip: Intel CPU frequency behavior