Compiling NAMD

NAMD is a molecular dynamics simulation package.[1]

NOTE As of 2014-04-15, the only reliable build to run on the AMD nodes is GCC + MVAPICH2.

AMD gives some guidance on compiling NAMD using Open64: Building NAMD

Refer to the official documentation: http://www.ks.uiuc.edu/Research/namd/2.12/ug/

Running NAMD

For information on running NAMD, see: NAMD

General Outline of Build Process

Build Charm++[2]
Fix up NAMD makefiles
Build NAMD

Locally-compiled version using Intel MKL

See NAMD#CPU-only

Including CUDA 6.0

CUDA 6.0 cannot be used with the Intel compilers.

GCC with IB Verbs + CUDA 6.5

NOTE do not use MPI

This is for NAMD 2.12

See the recommendations from NVIDIA: https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/namd/

Modules

NAMD 2.12 requires CUDA 6.5 for building.

The following modules should be unloaded:

cuda60/toolkit/6.0.37
cuda60/tdk/331.62
cuda60/profiler/6.0.37
cuda60/nsight/6.0.37
cuda60/blas/6.0.37
cuda60/fft/6.0.37

And load the CUDA 6.5 modules instead:

cuda65/toolkit/6.5.14
cuda65/gdk/340.29
cuda65/fft/6.5.14
cuda65/blas/6.5.14
cuda65/profiler/6.5.14
cuda65/nsight/6.5.14
cuda-driver/340.32

And also the FFTW3 module:

proteus-fftw3/gcc/64/3.3.3

Charm++ 6.7.1

Build:

./build charm++ verbs-linux-x86_64 gcc smp gfortran -j16 --with-production

Test:

cd verbs-linux-x86_64-gfortran-smp-gcc/tests/charm++/megatest make clean make pgm ./charmrun ++ppn 16 ./pgm

NAMD

Modify the file for FFTW3 support (dynamic linking):

### arch/Linux-x86_64.fftw3
FFTDIR=$(FFTW3HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=-Wl,-rpath,$(FFTDIR)/lib -L$(FFTDIR)/lib -lfftw3f
FFTFLAGS=-DNAMD_FFTW -DNAMD_FFTW_3
FFT=$(FFTINCL) $(FFTFLAGS)

Modify the file for Tcl support:

### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)

Modify the file for CUDA support - the K40Xm devices are compute capability 3.5:

### arch/Linux-x86-64.cuda
CUDADIR=$(CUDA_PATH)
CUDAINCL=-I$(CUDADIR)/include
CUBDIR=.rootdir/cub
CUBINCL=-I$(CUBDIR)
CUDALIB=-L$(CUDADIR)/lib64 -lcufft_static -lculibos -lcudart_static -lrt
CUDASODIR=$(CUDADIR)/lib64
LIBCUDARTSO=
CUDAFLAGS=-DNAMD_CUDA
CUDAOBJS=$(CUDAOBJSRAWSTATIC)
CUDA=$(CUDAFLAGS) -I. $(CUDAINCL) $(CUBINCL)
CUDAGENCODE=-gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35
CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 $(CUDAGENCODE) -Xcompiler "-m64" $(CUDA)

Create the makefile for building NAMD:

### arch/Linux-x86_64-verbs-g++.arch
NAMD_ARCH = Linux-x86_64
# this matches the directory name produced by the Charm++ build
CHARMARCH = verbs-linux-x86_64-gfortran-smp-gcc

CXX     = g++ -m64 -std=c++0x -O3 -march=native
CXXOPTS = -fexpensive-optimizations -ffast-math
CC      = gcc -m64 -O3 -march=native
COPTS   = -fexpensive-optimizations -ffast-math

Configure:

./config Linux-x86_64-verbs-g++ --with-fftw3 --with-cuda --cuda-prefix $CUDA_PATH

Build:

cd Linux-x86_64-verbs-g++
make -j 24 >& Make.out

Test: see the NAMD article for an example job script to test.

GCC, with MVAPICH2

Summary:

gcc
mvapich2
CUDA

Environment

Currently Loaded Modulefiles: 1) shared 3) gcc/4.8.1 5) sge/univa 2) proteus 4) proteus-mvapich2/gcc/64/1.9-mlnx-ofed 6) cuda65/toolkit

The base directory of the source tree is NAMD_2.9_Source.

For convenience, set this environment variable:

$ export NAMD_SOURCE=..../NAMD_2.9_Source

charm 6.4.0

Expand:

cd NAMD_2.9_Source tar xf charm-6.4.0.tar cd charm-6.4.0

Build charm:

./build charm++ mpi-linux-x86_64 mpicxx gfortran --with-production -DCMK_OPTIMIZE=1 |tee build_charm.out

Test:

cd mpi-linux-x86_64-gfortran-mpicxx/tests/charm++/megatest make clean make pgm mpirun -n 16 ./pgm ### alternatively # ./charmrun -ppn 16 ./pgm

Test in cluster - should be under 5 min:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -q all.q@@intelhosts
#$ -l h_rt=0:30:00
#$ -l h_vmem=1G

. /etc/profile
module load shared
module load sge/univa
module load proteus
module load gcc
module load proteus-mvapich2/gcc

module list

cat $PE_HOSTFILE
which mpirun

# specifying "-rmk sge" tells mpirun that it is running under the Grid Engine environment
# add "-verbose" for messages about MPI
#mpirun -rmk sge ./pgm

# however, mpirun should automatically detect the Grid Engine environment
mpirun ./pgm

Output should look something like:

Currently Loaded Modulefiles:
  1) shared
  2) proteus
  3) gcc/4.8.1
  4) sge/univa
  5) proteus-mvapich2/gcc/64/1.9-mlnx-ofed
ic04n01.cm.cluster 16 gpu.q@ic04n01.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n02.cm.cluster 16 gpu.q@ic04n02.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n03.cm.cluster 16 gpu.q@ic04n03.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n04.cm.cluster 16 gpu.q@ic04n04.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n01.cm.cluster 16 gpu.q@ic05n01.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n02.cm.cluster 16 gpu.q@ic05n02.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n03.cm.cluster 16 gpu.q@ic05n03.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n04.cm.cluster 16 gpu.q@ic05n04.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/bin/mpirun
Charm++> Running on MPI version: 3.0
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 8 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.630 seconds.
Megatest is running on 128 nodes 128 processors.
test 0: initiated [completion_test (phil)]
Starting test
Created detector, starting first detection
Started first test
Finished second test
Started third test
test 0: completed (18.61 sec)
test 1: initiated [inlineem (phil)]
test 1: completed (0.10 sec)
test 2: initiated [callback (olawlor)]
test 2: completed (11.91 sec)
test 3: initiated [immediatering (gengbin)]
...
test 52: initiated [multi groupring (milind)]
test 52: completed (0.08 sec)
test 53: initiated [all-at-once]
Starting test
Created detector, starting first detection
Started first test
Finished second test
Started third test
test 53: completed (0.40 sec)
All tests completed, exiting
End of program

NAMD

cd back to the top of the source distribution. Next to be done is to modify and/or create several files in the subdirectory named "arch".

Modify file for linking to FFTW2 single precision -- arch/Linux-x86_64.fftw:

### arch/Linux-x86_64.fftw
### static linking
FFTDIR=$(FFTW2HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=$(FFTDIR)/lib/libsrfftw.a $(FFTDIR)/lib/libsfftw.a
FFTFLAGS=-DNAMD_FFTW
FFT=$(FFTINCL) $(FFTFLAGS)

Modify file for linking to Tcl 8.5 -- arch/Linux-x86_64.tcl:

### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)

Create file for linking with CUDA -- arch/Linux-x86_64.cuda:

CUDADIR=$(CUDA_PATH)
CUDAINCL=-I$(CUDADIR)/include
CUDALIB=-L$(CUDADIR)/lib64 -lcudart
CUDASODIR=$(CUDADIR)/lib64
LIBCUDARTSO=libcudart.so
CUDAFLAGS=-DNAMD_CUDA
CUDAOBJS=$(CUDAOBJSRAW)
CUDA=$(CUDAFLAGS) -I. $(CUDAINCL)
CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 -gencode arch=compute_35,code=sm_35 -Xcompiler "-m64" $(CUDA)

Create file for building NAMD -- arch/Linux-x86_64-MVAPICH2-gcc.arch:

### arch/Linux-x86_64-MVAPICH2-gcc.arch
NAMD_ARCH = Linux-x86_64

# CHARMARCH matches the string from the charm-6.4.0 build above
CHARMARCH = mpi-linux-x86_64-gfortran-mpicxx

# NB: AMD: -march=bdver2
#     Intel: -march=core-avx-i
#     AMD & Intel: -msse4.2 -mfpmath=sse -mavx
FLOATOPTS= -O3 -msse4.2 -mfpmath=sse -mavx
#FLOATOPTS= -O3 -march=bdver2
#FLOATOPTS= -O3 -march=core-avx-i
CXX = g++ -malign-double -fPIC -DSOCKLEN_T=socklen_t -I$(CHARM_LOC)/include
CXXOPTS = $(FLOATOPTS)
CXXNOALIASOPTS = $(FLOATOPTS)
CC = gcc -malign-double -fPIC
COPTS = $(FLOATOPTS)

Configure using name of newly-created arch file:

cd $NAMD_SOURCE ./config Linux-x86_64-MVAPICH2-gcc

CUDA version:

./config Linux-x86_64-MVAPICH2-gcc-cuda --with-cuda --cuda-prefix $CUDA_PATH

Build:

cd Linux-x86_64-MVAPICH2-gcc make -j 16 >& Make.out &

Test

Use the apoa1 benchmark[3] -- download the .tar.gz file

Modify the input file apoa1.namd, adding your username to the output file name:

outputname /usr/tmp/apoa1-out-my_user_name

Use this job script (modify appropriately) -- shouldn't take more than 5 minutes:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -l h_rt=01:00:00
#$ -l h_vmem=2G
#$ -l m_mem_free=2G
#$ -q all.q@@amdhosts

. /etc/profile
module load shared
module load proteus
module load sge/univa
module load gcc
module load proteus-mvapich2/gcc/64
module load proteus-fftw2/gcc/64/float

mpirun ./namd2 apoa1.namd

Benchmark result:

Info: Benchmark time: 128 CPUs 0.0259024 s/step 0.299796 days/ns 703.594 MB memory

Open64, with MVAPICH2

NOTE Open64 compilation does not seem to work for MPI no matter which implementation is used. Seems to work OK for shared memory parallel, i.e. single-node multi-core.

Modules:

1) shared 3) sge/univa 5) proteus-openmpi/open64/64/1.6.5-mlnx-ofed 2) proteus 4) open64/4.5.2.1

Environment:

CC = opencc CXX = openCC F77 = openf90 ## CFLAGS = -O3 -mso -m64 -march=bdver2 -DCMK_OPTIMIZE=1

charm-6.4.0

Expand:

cd NAMD_2.9_Source tar xf charm-6.4.0.tar cd charm-6.4.0

Build:

~~./build`` ``charm++`` ``mpi-linux-x86_64`` ``mpicxx`` ``smp`` ``-j16`` ``--with-production`` ``-DCMK_OPTIMIZE=1~~ ./build charm++ mpi-linux-x86_64 mpicxx -j16 --with-production -DCMK_OPTIMIZE=1

Test:

cd charm-6.4.0/mpi-linux-x86_64-smp-mpicxx/tests/charm++/megatest make clean make pgm mpirun -n 16 ./pgm ### alternatively # ./charmrun -ppn 16 ./pgm

Test in cluster. Create job script in the megatest directory and submit:

#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -q all.q@@amdhosts
#$ -l h_rt=2:00:00
#$ -l h_vmem=3G

. /etc/profile
module load shared
module load sge/univa
module load proteus
module load open64
module load proteus-mvapich2/open64

module list

cat $PE_HOSTFILE
which mpirun

mpirun -n $NSLOTS ./pgm
### alternatively
# ./charmrun -ppn $NSLOTS ./pgm

NAMD

Modify file for linking to FFTW2 -- arch/Linux-x86_64.fftw:

### arch/Linux-x86_64.fftw
### static linking
FFTDIR=$(FFTW2HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=$(FFTDIR)/lib/libsrfftw.a $(FFTDIR)/lib/libsfftw.a
FFTFLAGS=-DNAMD_FFTW
FFT=$(FFTINCL) $(FFTFLAGS)

Modify file for linking to Tcl 8.5 -- arch/Linux-x86_64.tcl:

### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)

Create file for building NAMD -- arch/Linux-x86_64-MPI-open64.arch:

### arch/Linux-x86_64-MPI-open64.arch
NAMD_ARCH = Linux-x86_64

# CHARMARCH matches the string from the charm-6.4.0 build above
CHARMARCH = mpi-linux-x86_64-smp-mpicxx

FLOATOPTS= -O3 -mso -march=bdver2
CXX = openCC -m64 -fPIC -DSOCKLEN_T=socklen_t -I$(CHARM_LOC)/include
CXXOPTS = $(FLOATOPTS)
CXXNOALIASOPTS = $(FLOATOPTS)
CC = opencc -m64 -fPIC
COPTS = $(FLOATOPTS)

Configure using name of newly-created arch file:

cd $NAMD_SOURCE ./config Linux-x86_64-MPI-open64

Build:

cd Linux-x86_64-MPI-open64 make -j 16 >& Make.out &

Open64, with OpenMPI

... TBA ...

General note on using CUDA

From Jim Phillips, Sr. Research Programmer for NAMD:

CUDA + MPI is slow with or without SMP, which is why we put effort into network-specific Charm++ machine layers like ibverbs. The ibverbs port can be launched with "charmrun ++mpiexec", which will use the system mpiexec internally while allowing us to distribute portable binaries that don't require extra scripting to adapt to a queueing sytem.

Their recommendation is to use a pre-compiled release "smp-ibverbs-CUDA".

NVIDIA notes on using CUDA

NVIDIA has some recommended runtime options[4] These notes are reproduced here:

General command line to run NAMD on a single-node system:

namd2 {namdOpts} {inputFile}

On a multi-node system NAMD has to be run with charmrun as specified below:

charmrun {charmOpts} namd2 {namdOpts} {inputFile}

{charmOpts}

++nodelist {nodeListFile} - multi-node runs require a list of nodes
Charm++ also supports an alternative ++mpiexec option if you're using a queueing system that mpiexec is setup to recognize.
++p $totalPes - specifies the total number of PE threads
- This is the total number of Worker Threads (aka PE threads). We recommend this to be equal to (#TotalCPUCores - #TotalGPUs).
++ppn $pesPerProcess - number of PEs per process
- We recommend to set this to #ofCoresPerNode/#ofGPUsPerNode – 1
- This is necessary to free one of the threads per process for communication. Make sure to specify +commap below.
- Total number of processes is equal to $totalPes/$pesPerProcess
- When using the recommended value for this option, each process will use a single GPU

{namdOpts}

NAMD will inherit '++p' and '++ppn' as '+p' and '+ppn' if set in {charmOpts}
Otherwise, for the multi-core build use '+p' to set to the number of cores.
It is recommended to have no more than one process per GPU in the multi-node run. To get more communication threads, it is recommended to launch exactly one process per GPU. For single-node it is fine to use multiple GPUs per process.
CPU affinity options (see user guide[5]):
- '+setcpuaffinity' in order to keep threads from moving about
- '+pemap #-#' - this maps computational threads to CPU cores
- '+commap #-#' - this sets range for communication threads
- Example for dual-socket configuration with 16 cores per socket:

+setcpuaffinity +pemap 1-15,17-31 +commap 0,16

GPU options (see user guide[6]):
- '+devices {CUDA IDs}' - optionally specify device IDs to use in NAMD
- If devices are not in socket order it might be useful to set this option to ensure that sockets use their directly-attached GPUs, for example, '+devices 2,3,0,1'

We recommend to always check the startup messages in NAMD to make sure the options are set correctly. Additionally, ++verbose option can provide a more detailed output for the execution for runs that use charmrun. Running top or other system tools can help you make sure you’re getting the requested thread mapping.

{inputFile}

Use corresponding *.namd input file from one of the datasets in the next sub-section.

Example 1. Run ApoA1 on 1 node with 2xGPU and 2xCPU (20 cores total) and multi-core NAMD build:

./namd2 +p 20 +devices 0,1 apoa1.namd

Example 2. Run STMV on 2 nodes, each node with 2xGPU and 2xCPU (20 cores) and SMP NAMD build (note that we launch 4 processes, each controlling 1 GPU):

charmrun ++p 36 ++ppn 9 ./namd2 ++nodelist $NAMD_NODELIST +setcpuaffinity +pemap 1-9,11-19 +commap 0,10 +devices 0,1 stmv.namd

Note that by default, the "rsh" command is used to start namd2 on each node specified in the nodelist file. You can change this via the CONV_RSH environment variable, i.e., to use ssh instead of rsh run "export CONV_RSH=ssh" (see NAMD release notes for details).

Proteus GPU nodes

For Proteus, running NAMD on all 8 GPU nodes, the nodelist file should be:

group main
host gpu01
host gpu01
host gpu02
host gpu02
host gpu03
host gpu03
host gpu04
host gpu04
host gpu05
host gpu05
host gpu06
host gpu06
host gpu07
host gpu07
host gpu08
host gpu08

And the commandline to run is:

charmrun ++nodelist ./nodelist ++p 16 ++ppn 2 ${NAMD2EXE} +setcpuaffinity +pemap 1-7,9-15 +commap 0,8 inputfile.namd

Benchmarks

The HPC Advisory Council produced a presentation in 2011 comparing NAMD performance when using various implementations of MPI and with various network fabrics. They ran on Dell servers with AMD Opteron 6174 "Magny-Cours" CPUs.[7] The benchmarking job is a standard one (ApoA1) provided by the NAMD authors.[8]

When running apoa1, modify the input file apoa1.namd adding your username:

outputname /usr/tmp/apoa1-out-my_user_name

You may also want to do a longer run to minimize the effect of startup latency.

EXPERIMENTAL - ibverbs smp Intel

./build charm++ net-linux-x86_64 icc ibverbs smp ifort -j16 --with-production

Example arch file for Open64 from C. Abrams

# save this as Linux-x86_64-open64.arch in the arch subdirectory NAMD_ARCH = Linux-x86_64 CHARMARCH = multicore-linux64 CXX = openCC CXXOPTS = -O3 -ffast-math CC = opencc COPTS = -O3 -ffast-math

Update 2019-01-17 - New Xeon Gold 6148 Nodes on RHEL 6.8

Reference: http://www.ks.uiuc.edu/Research/namd/2.13/notes.html
Use intel/composerxe/2019u1 + devtoolset-6 environment

SMP vs Threads

From http://www.ks.uiuc.edu/Research/namd/2.13/ug/node100.html

Since one core per process is used for the communication thread SMP builds are typically slower than non-SMP builds. The advantage of SMP builds is that many data structures are shared among the threads, reducing the per-core memory footprint when scaling large simulations to large numbers of cores.

Charm++

Just do

./build

and answer questions. Once you decide all the answers, you can do it all on a single line:

./build charm++ verbs-linux-x86_64 icc ifort -j32 -O3 -xHost

NAMD

Will need to fix the .mkl and Linux-x86_64-icc.arch files if want to use static build. Check Link Line Advisor for specifics. Also, Charm++ would have to have been built for iccstatic. So, just do dynamic linking, and remove the "-static-intel" option in the "-icc.arch" file.

FFTW3 -- copy the MKL file to use MKL FFTW3 interface: NOTE may not be necessary since will use "--with-mkl"

mv Linux-x86_64.fftw3 Linux-x86_64.fftw3.orig cp Linux-x86_64.mkl Linux-x86_64.fftw3

Linux-x86_64-icc.arch:

NAMD_ARCH = Linux-x86_64

### the CHARMARCH here must match the Charm++ build in the section above
CHARMARCH = verbs-linux-x86_64-ifort-icc

FLOATOPTS = -ip -axAVX

CXX = icpc -std=c++11
CXXOPTS = -O2 $(FLOATOPTS) -xHost
CXXNOALIASOPTS = -O2 -fno-alias $(FLOATOPTS) -xHost
CXXCOLVAROPTS = -O2 -ip -xHost

CC = icc
COPTS = -O2 $(FLOATOPTS) -xHost