Compiling NAMD
NAMD is a molecular dynamics simulation package.[1]
NOTE As of 2014-04-15, the only reliable build to run on the AMD nodes is GCC + MVAPICH2.
AMD gives some guidance on compiling NAMD using Open64: Building NAMD
Refer to the official documentation: http://www.ks.uiuc.edu/Research/namd/2.12/ug/
Running NAMD
For information on running NAMD, see: NAMD
General Outline of Build Process
- Build Charm++[2]
- Fix up NAMD makefiles
- Build NAMD
Locally-compiled version using Intel MKL
See NAMD#CPU-only
Including CUDA 6.0
CUDA 6.0 cannot be used with the Intel compilers.
GCC with IB Verbs + CUDA 6.5
NOTE do not use MPI
This is for NAMD 2.12
See the recommendations from NVIDIA: https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/namd/
Modules
NAMD 2.12 requires CUDA 6.5 for building.
The following modules should be unloaded:
cuda60/toolkit/6.0.37
cuda60/tdk/331.62
cuda60/profiler/6.0.37
cuda60/nsight/6.0.37
cuda60/blas/6.0.37
cuda60/fft/6.0.37
And load the CUDA 6.5 modules instead:
cuda65/toolkit/6.5.14
cuda65/gdk/340.29
cuda65/fft/6.5.14
cuda65/blas/6.5.14
cuda65/profiler/6.5.14
cuda65/nsight/6.5.14
cuda-driver/340.32
And also the FFTW3 module:
proteus-fftw3/gcc/64/3.3.3
Charm++ 6.7.1
Build:
./build charm++ verbs-linux-x86_64 gcc smp gfortran -j16 --with-production
Test:
cd verbs-linux-x86_64-gfortran-smp-gcc/tests/charm++/megatest
make clean
make pgm
./charmrun ++ppn 16 ./pgm
NAMD
Modify the file for FFTW3 support (dynamic linking):
### arch/Linux-x86_64.fftw3
FFTDIR=$(FFTW3HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=-Wl,-rpath,$(FFTDIR)/lib -L$(FFTDIR)/lib -lfftw3f
FFTFLAGS=-DNAMD_FFTW -DNAMD_FFTW_3
FFT=$(FFTINCL) $(FFTFLAGS)
Modify the file for Tcl support:
### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)
Modify the file for CUDA support - the K40Xm devices are compute capability 3.5:
### arch/Linux-x86-64.cuda
CUDADIR=$(CUDA_PATH)
CUDAINCL=-I$(CUDADIR)/include
CUBDIR=.rootdir/cub
CUBINCL=-I$(CUBDIR)
CUDALIB=-L$(CUDADIR)/lib64 -lcufft_static -lculibos -lcudart_static -lrt
CUDASODIR=$(CUDADIR)/lib64
LIBCUDARTSO=
CUDAFLAGS=-DNAMD_CUDA
CUDAOBJS=$(CUDAOBJSRAWSTATIC)
CUDA=$(CUDAFLAGS) -I. $(CUDAINCL) $(CUBINCL)
CUDAGENCODE=-gencode arch=compute_35,code=sm_35 -gencode arch=compute_35,code=compute_35
CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 $(CUDAGENCODE) -Xcompiler "-m64" $(CUDA)
Create the makefile for building NAMD:
### arch/Linux-x86_64-verbs-g++.arch
NAMD_ARCH = Linux-x86_64
# this matches the directory name produced by the Charm++ build
CHARMARCH = verbs-linux-x86_64-gfortran-smp-gcc
CXX = g++ -m64 -std=c++0x -O3 -march=native
CXXOPTS = -fexpensive-optimizations -ffast-math
CC = gcc -m64 -O3 -march=native
COPTS = -fexpensive-optimizations -ffast-math
Configure:
./config Linux-x86_64-verbs-g++ --with-fftw3 --with-cuda --cuda-prefix $CUDA_PATH
Build:
cd Linux-x86_64-verbs-g++
make -j 24 >& Make.out
Test: see the NAMD article for an example job script to test.
GCC, with MVAPICH2
Summary:
- gcc
- mvapich2
- CUDA
Environment
Currently Loaded Modulefiles:
1) shared 3) gcc/4.8.1 5) sge/univa
2) proteus 4) proteus-mvapich2/gcc/64/1.9-mlnx-ofed 6) cuda65/toolkit
The base directory of the source tree is NAMD_2.9_Source
.
For convenience, set this environment variable:
$ export NAMD_SOURCE=..../NAMD_2.9_Source
charm 6.4.0
Expand:
cd NAMD_2.9_Source
tar xf charm-6.4.0.tar
cd charm-6.4.0
Build charm:
./build charm++ mpi-linux-x86_64 mpicxx gfortran --with-production -DCMK_OPTIMIZE=1 |tee build_charm.out
Test:
cd mpi-linux-x86_64-gfortran-mpicxx/tests/charm++/megatest
make clean
make pgm
mpirun -n 16 ./pgm
### alternatively
# ./charmrun -ppn 16 ./pgm
Test in cluster - should be under 5 min:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -q all.q@@intelhosts
#$ -l h_rt=0:30:00
#$ -l h_vmem=1G
. /etc/profile
module load shared
module load sge/univa
module load proteus
module load gcc
module load proteus-mvapich2/gcc
module list
cat $PE_HOSTFILE
which mpirun
# specifying "-rmk sge" tells mpirun that it is running under the Grid Engine environment
# add "-verbose" for messages about MPI
#mpirun -rmk sge ./pgm
# however, mpirun should automatically detect the Grid Engine environment
mpirun ./pgm
Output should look something like:
Currently Loaded Modulefiles:
1) shared
2) proteus
3) gcc/4.8.1
4) sge/univa
5) proteus-mvapich2/gcc/64/1.9-mlnx-ofed
ic04n01.cm.cluster 16 gpu.q@ic04n01.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n02.cm.cluster 16 gpu.q@ic04n02.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n03.cm.cluster 16 gpu.q@ic04n03.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic04n04.cm.cluster 16 gpu.q@ic04n04.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n01.cm.cluster 16 gpu.q@ic05n01.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n02.cm.cluster 16 gpu.q@ic05n02.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n03.cm.cluster 16 gpu.q@ic05n03.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
ic05n04.cm.cluster 16 gpu.q@ic05n04.cm.cluster 0,0:0,1:0,2:0,3:0,4:0,5:0,6:0,7:1,0:1,1:1,2:1,3:1,4:1,5:1,6:1,7
/mnt/HA/opt/mvapich2/gcc/64/1.9-mlnx-ofed/bin/mpirun
Charm++> Running on MPI version: 3.0
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_SINGLE)
Charm++> Running on non-SMP mode
Converse/Charm++ Commit ID: v6.4.0-beta1-0-g5776d21
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 8 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.630 seconds.
Megatest is running on 128 nodes 128 processors.
test 0: initiated [completion_test (phil)]
Starting test
Created detector, starting first detection
Started first test
Finished second test
Started third test
test 0: completed (18.61 sec)
test 1: initiated [inlineem (phil)]
test 1: completed (0.10 sec)
test 2: initiated [callback (olawlor)]
test 2: completed (11.91 sec)
test 3: initiated [immediatering (gengbin)]
...
test 52: initiated [multi groupring (milind)]
test 52: completed (0.08 sec)
test 53: initiated [all-at-once]
Starting test
Created detector, starting first detection
Started first test
Finished second test
Started third test
test 53: completed (0.40 sec)
All tests completed, exiting
End of program
NAMD
cd back to the top of the source distribution. Next to be done is to
modify and/or create several files in the subdirectory named "arch
".
Modify file for linking to FFTW2 single precision --
arch/Linux-x86_64.fftw
:
### arch/Linux-x86_64.fftw
### static linking
FFTDIR=$(FFTW2HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=$(FFTDIR)/lib/libsrfftw.a $(FFTDIR)/lib/libsfftw.a
FFTFLAGS=-DNAMD_FFTW
FFT=$(FFTINCL) $(FFTFLAGS)
Modify file for linking to Tcl 8.5 -- arch/Linux-x86_64.tcl
:
### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)
Create file for linking with CUDA -- arch/Linux-x86_64.cuda
:
CUDADIR=$(CUDA_PATH)
CUDAINCL=-I$(CUDADIR)/include
CUDALIB=-L$(CUDADIR)/lib64 -lcudart
CUDASODIR=$(CUDADIR)/lib64
LIBCUDARTSO=libcudart.so
CUDAFLAGS=-DNAMD_CUDA
CUDAOBJS=$(CUDAOBJSRAW)
CUDA=$(CUDAFLAGS) -I. $(CUDAINCL)
CUDACC=$(CUDADIR)/bin/nvcc -O3 --maxrregcount 32 -gencode arch=compute_35,code=sm_35 -Xcompiler "-m64" $(CUDA)
Create file for building NAMD -- arch/Linux-x86_64-MVAPICH2-gcc.arch
:
### arch/Linux-x86_64-MVAPICH2-gcc.arch
NAMD_ARCH = Linux-x86_64
# CHARMARCH matches the string from the charm-6.4.0 build above
CHARMARCH = mpi-linux-x86_64-gfortran-mpicxx
# NB: AMD: -march=bdver2
# Intel: -march=core-avx-i
# AMD & Intel: -msse4.2 -mfpmath=sse -mavx
FLOATOPTS= -O3 -msse4.2 -mfpmath=sse -mavx
#FLOATOPTS= -O3 -march=bdver2
#FLOATOPTS= -O3 -march=core-avx-i
CXX = g++ -malign-double -fPIC -DSOCKLEN_T=socklen_t -I$(CHARM_LOC)/include
CXXOPTS = $(FLOATOPTS)
CXXNOALIASOPTS = $(FLOATOPTS)
CC = gcc -malign-double -fPIC
COPTS = $(FLOATOPTS)
Configure using name of newly-created arch file:
cd $NAMD_SOURCE
./config Linux-x86_64-MVAPICH2-gcc
CUDA version:
./config Linux-x86_64-MVAPICH2-gcc-cuda --with-cuda --cuda-prefix $CUDA_PATH
Build:
cd Linux-x86_64-MVAPICH2-gcc
make -j 16 >& Make.out &
Test
Use the apoa1 benchmark[3] -- download the .tar.gz file
Modify the input file apoa1.namd
, adding your username to the output
file name:
outputname /usr/tmp/apoa1-out-
my_user_name
Use this job script (modify appropriately) -- shouldn't take more than 5 minutes:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -l h_rt=01:00:00
#$ -l h_vmem=2G
#$ -l m_mem_free=2G
#$ -q all.q@@amdhosts
. /etc/profile
module load shared
module load proteus
module load sge/univa
module load gcc
module load proteus-mvapich2/gcc/64
module load proteus-fftw2/gcc/64/float
mpirun ./namd2 apoa1.namd
Benchmark result:
Info: Benchmark time: 128 CPUs 0.0259024 s/step 0.299796 days/ns 703.594 MB memory
Open64, with MVAPICH2
NOTE Open64 compilation does not seem to work for MPI no matter which implementation is used. Seems to work OK for shared memory parallel, i.e. single-node multi-core.
Modules:
1) shared 3) sge/univa 5) proteus-openmpi/open64/64/1.6.5-mlnx-ofed
2) proteus 4) open64/4.5.2.1
Environment:
CC = opencc
CXX = openCC
F77 = openf90
## CFLAGS = -O3 -mso -m64 -march=bdver2 -DCMK_OPTIMIZE=1
charm-6.4.0
Expand:
cd NAMD_2.9_Source
tar xf charm-6.4.0.tar
cd charm-6.4.0
Build:
~~
./build`` ``charm++`` ``mpi-linux-x86_64`` ``mpicxx`` ``smp`` ``-j16`` ``--with-production`` ``-DCMK_OPTIMIZE=1
~~
./build charm++ mpi-linux-x86_64 mpicxx -j16 --with-production -DCMK_OPTIMIZE=1
Test:
cd charm-6.4.0/mpi-linux-x86_64-smp-mpicxx/tests/charm++/megatest
make clean
make pgm
mpirun -n 16 ./pgm
### alternatively
# ./charmrun -ppn 16 ./pgm
Test in cluster. Create job script in the megatest directory and submit:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -j y
#$ -P fixmePrj
#$ -M fixme@drexel.edu
#$ -pe mvapich 128
#$ -q all.q@@amdhosts
#$ -l h_rt=2:00:00
#$ -l h_vmem=3G
. /etc/profile
module load shared
module load sge/univa
module load proteus
module load open64
module load proteus-mvapich2/open64
module list
cat $PE_HOSTFILE
which mpirun
mpirun -n $NSLOTS ./pgm
### alternatively
# ./charmrun -ppn $NSLOTS ./pgm
NAMD
Modify file for linking to FFTW2 -- arch/Linux-x86_64.fftw
:
### arch/Linux-x86_64.fftw
### static linking
FFTDIR=$(FFTW2HOME)
FFTINCL=-I$(FFTDIR)/include
FFTLIB=$(FFTDIR)/lib/libsrfftw.a $(FFTDIR)/lib/libsfftw.a
FFTFLAGS=-DNAMD_FFTW
FFT=$(FFTINCL) $(FFTFLAGS)
Modify file for linking to Tcl 8.5 -- arch/Linux-x86_64.tcl
:
### arch/Linux-x86_64.tcl
TCLDIR=
TCLINCL=
TCLLIB=-ltcl8.5 -ldl -lpthread
TCLFLAGS=-DNAMD_TCL
TCL=$(TCLINCL) $(TCLFLAGS)
Create file for building NAMD -- arch/Linux-x86_64-MPI-open64.arch
:
### arch/Linux-x86_64-MPI-open64.arch
NAMD_ARCH = Linux-x86_64
# CHARMARCH matches the string from the charm-6.4.0 build above
CHARMARCH = mpi-linux-x86_64-smp-mpicxx
FLOATOPTS= -O3 -mso -march=bdver2
CXX = openCC -m64 -fPIC -DSOCKLEN_T=socklen_t -I$(CHARM_LOC)/include
CXXOPTS = $(FLOATOPTS)
CXXNOALIASOPTS = $(FLOATOPTS)
CC = opencc -m64 -fPIC
COPTS = $(FLOATOPTS)
Configure using name of newly-created arch file:
cd $NAMD_SOURCE
./config Linux-x86_64-MPI-open64
Build:
cd Linux-x86_64-MPI-open64
make -j 16 >& Make.out &
Open64, with OpenMPI
... TBA ...
General note on using CUDA
From Jim Phillips, Sr. Research Programmer for NAMD:
CUDA + MPI is slow with or without SMP, which is why we put effort into
network-specific Charm++ machine layers like ibverbs. The ibverbs port
can be launched with "charmrun ++mpiexec", which will use the system mpiexec
internally while allowing us to distribute portable binaries that don't
require extra scripting to adapt to a queueing sytem.
Their recommendation is to use a pre-compiled release "smp-ibverbs-CUDA".
NVIDIA notes on using CUDA
NVIDIA has some recommended runtime options[4] These notes are reproduced here:
General command line to run NAMD on a single-node system:
namd2 {namdOpts} {inputFile}
On a multi-node system NAMD has to be run with charmrun as specified below:
charmrun {charmOpts} namd2 {namdOpts} {inputFile}
{charmOpts}
++nodelist {nodeListFile}
- multi-node runs require a list of nodes- Charm++ also supports an alternative
++mpiexec
option if you're using a queueing system thatmpiexec
is setup to recognize. ++p $totalPes
- specifies the total number of PE threads- This is the total number of Worker Threads (aka PE threads). We
recommend this to be equal to
(#TotalCPUCores - #TotalGPUs)
.
- This is the total number of Worker Threads (aka PE threads). We
recommend this to be equal to
++ppn $pesPerProcess
- number of PEs per process- We recommend to set this to
#ofCoresPerNode/#ofGPUsPerNode – 1
- This is necessary to free one of the threads per process for
communication. Make sure to specify
+commap
below. - Total number of processes is equal to
$totalPes/$pesPerProcess
- When using the recommended value for this option, each process will use a single GPU
- We recommend to set this to
{namdOpts}
- NAMD will inherit
'++p'
and'++ppn'
as'+p'
and'+ppn'
if set in{charmOpts}
- Otherwise, for the multi-core build use
'+p'
to set to the number of cores. - It is recommended to have no more than one process per GPU in the multi-node run. To get more communication threads, it is recommended to launch exactly one process per GPU. For single-node it is fine to use multiple GPUs per process.
- CPU affinity options (see user guide[5]):
'+setcpuaffinity'
in order to keep threads from moving about'+pemap #-#'
- this maps computational threads to CPU cores'+commap #-#'
- this sets range for communication threads- Example for dual-socket configuration with 16 cores per socket:
+setcpuaffinity +pemap 1-15,17-31 +commap 0,16
- GPU options (see user guide[6]):
'+devices {CUDA IDs}'
- optionally specify device IDs to use in NAMD- If devices are not in socket order it might be useful to set
this option to ensure that sockets use their directly-attached
GPUs, for example,
'+devices 2,3,0,1'
We recommend to always check the startup messages in NAMD to make sure the options are set correctly. Additionally, ++verbose option can provide a more detailed output for the execution for runs that use charmrun. Running top or other system tools can help you make sure you’re getting the requested thread mapping.
{inputFile}
- Use corresponding *.namd input file from one of the datasets in the next sub-section.
Example 1. Run ApoA1 on 1 node with 2xGPU and 2xCPU (20 cores total) and multi-core NAMD build:
./namd2 +p 20 +devices 0,1 apoa1.namd
Example 2. Run STMV on 2 nodes, each node with 2xGPU and 2xCPU (20 cores) and SMP NAMD build (note that we launch 4 processes, each controlling 1 GPU):
charmrun ++p 36 ++ppn 9 ./namd2 ++nodelist $NAMD_NODELIST +setcpuaffinity +pemap 1-9,11-19 +commap 0,10 +devices 0,1 stmv.namd
Note that by default, the "rsh" command is used to start namd2 on each node specified in the nodelist file. You can change this via the CONV_RSH environment variable, i.e., to use ssh instead of rsh run "export CONV_RSH=ssh" (see NAMD release notes for details).
Proteus GPU nodes
For Proteus, running NAMD on all 8 GPU nodes, the nodelist file should be:
group main
host gpu01
host gpu01
host gpu02
host gpu02
host gpu03
host gpu03
host gpu04
host gpu04
host gpu05
host gpu05
host gpu06
host gpu06
host gpu07
host gpu07
host gpu08
host gpu08
And the commandline to run is:
charmrun ++nodelist ./nodelist ++p 16 ++ppn 2 ${NAMD2EXE} +setcpuaffinity +pemap 1-7,9-15 +commap 0,8 inputfile.namd
Benchmarks
The HPC Advisory Council produced a presentation in 2011 comparing NAMD performance when using various implementations of MPI and with various network fabrics. They ran on Dell servers with AMD Opteron 6174 "Magny-Cours" CPUs.[7] The benchmarking job is a standard one (ApoA1) provided by the NAMD authors.[8]
When running apoa1, modify the input file apoa1.namd
adding your
username:
outputname /usr/tmp/apoa1-out-
my_user_name
You may also want to do a longer run to minimize the effect of startup latency.
EXPERIMENTAL - ibverbs smp Intel
./build charm++ net-linux-x86_64 icc ibverbs smp ifort -j16 --with-production
Example arch file for Open64 from C. Abrams
# save this as Linux-x86_64-open64.arch in the arch subdirectory
NAMD_ARCH = Linux-x86_64
CHARMARCH = multicore-linux64
CXX = openCC
CXXOPTS = -O3 -ffast-math
CC = opencc
COPTS = -O3 -ffast-math
Update 2019-01-17 - New Xeon Gold 6148 Nodes on RHEL 6.8
- Reference: http://www.ks.uiuc.edu/Research/namd/2.13/notes.html
- Use intel/composerxe/2019u1 + devtoolset-6 environment
SMP vs Threads
Since one core per process is used for the communication thread SMP builds are typically slower than non-SMP builds. The advantage of SMP builds is that many data structures are shared among the threads, reducing the per-core memory footprint when scaling large simulations to large numbers of cores.
Charm++
Just do
./build
and answer questions. Once you decide all the answers, you can do it all on a single line:
./build charm++ verbs-linux-x86_64 icc ifort -j32 -O3 -xHost
NAMD
Will need to fix the .mkl
and Linux-x86_64-icc.arch
files if want to
use static build. Check Link Line Advisor for specifics. Also, Charm++
would have to have been built for iccstatic. So, just do dynamic
linking, and remove the "-static-intel
" option in the "-icc.arch
"
file.
FFTW3 -- copy the MKL file to use MKL FFTW3 interface: NOTE may not be necessary since will use "--with-mkl"
mv Linux-x86_64.fftw3 Linux-x86_64.fftw3.orig
cp Linux-x86_64.mkl Linux-x86_64.fftw3
Linux-x86_64-icc.arch:
NAMD_ARCH = Linux-x86_64
### the CHARMARCH here must match the Charm++ build in the section above
CHARMARCH = verbs-linux-x86_64-ifort-icc
FLOATOPTS = -ip -axAVX
CXX = icpc -std=c++11
CXXOPTS = -O2 $(FLOATOPTS) -xHost
CXXNOALIASOPTS = -O2 -fno-alias $(FLOATOPTS) -xHost
CXXCOLVAROPTS = -O2 -ip -xHost
CC = icc
COPTS = -O2 $(FLOATOPTS) -xHost
Configure:
./config Linux-x86_64-icc --with-mkl
Make:
cd Linux-x86_64-icc
make -j 32
See Also
- NAMD 2.9 Release Notes (includes compilation instructions)
- Compiling Quick Start Guide
- http://levlafayette.com/node/26
References
[2] Charm++ - Parallel Programming with Migratable Objects (UIUC Parallel Programming Laboratory)
[3] NAMD ApoA1 benchmark results and input files
[4] GPU-Accelerated NAMD - NAMD Running Instructions
[5] NAMD User Guide - Running NAMD - CPU Affinity