Compiling NCBI C++ Toolkit
NCBI C++ Toolkit version 12.0.0[1] is installed on Proteus. Use the module
ncbi-toolkit/gcc/64/12.0.0
There may be module dependencies -- see any warning messages that appear with "module load ncbi-toolkit".
Basic Usage♯
Load the Module♯
First, load the module which provides the toolkit:
[juser@proteusa01 ~]$ module load ncbi-toolkit/gcc/64/12.0.0
Check that it worked:
[juser@proteusa01 ~]$ which update_blastdb.pl
/mnt/HA/opt/ncbi-toolkit/gcc/64/12.0.0/bin/update_blastdb.pl
DB Location♯
Decide on a directory where all the database files will be downloaded.
If you are working in a group which will share the database, put this in
your group directory. We will use ~/ncbi_db/
:
[juser@proteusa01 ~]$ mkdir ncbi_db
Or use the local copy. See BLAST Databases
Set the BLASTDB Environment Variable♯
You can set the BLASTDB environment variable[2] in one of your .bashrc
file. You can also set it manually in any job script you write.
export BLASTDB=~/ncbi_db
Download an Updated Database♯
We will use the nr
database as an example. If the BLASTDB
environment is not set, manually set it in the shell. (See above.)
See what databases are available:
[juser@proteusa01 ~]$ update_blastdb.pl --showall
Just in case the BLASTDB
environment variable is not properly used by
the tools, cd into it, and do the update -- this will take up to an
hour:
[juser@proteusa01 ~]$ cd $BLASTDB
[juser@proteusa01 ncbi_db]$ update_blastdb.pl nr
...
After it completes, check that all files were downloaded correctly by doing the checksum:
[juser@proteusa01 ncbi_db]$ md5sum -c *.md5
nr.01.tar.gz: OK
nr.02.tar.gz: OK
...
Uncompress them all:
[juser@proteusa01 ncbi_db]$ for x in nr.*.tar.gz ; do tar xf $x ; done
...
This produces many files: .phr, .psd, .psq, etc.
Delete the tarballs:
[juser@proteusa01 ncbi_db]$ rm -f *.tar.gz
Retain the *.md5
files so that the update_blastdb.pl
script can tell
which db is up to date.
Run Multithreaded♯
The installation of NCBI Toolkit on Proteus does not use MPI, but it is
multithreaded. That means it can use multiple processor cores on a
single compute node, but will not do computations using multiple compute
nodes. Most NCBI Toolkit command line tools have the option to specify
the number of threads. In a job script, the NSLOTS
environment
variable is set in the job to be the number of slots requested. So:
#$ -pe shm 8
...
blastx -num_threads ${NSLOTS} ...
WARNING♯
Using the NCBI-hosted databases by using the "-remote" option will get Proteus blocked by NCBI due to overuse. This is true especially for batch jobs on the cluster.
Compiling♯
[juser@proteusi01 ncbi_cxx--12_0_0]$ module list
Currently Loaded Modulefiles:
1) shared 2) proteus 3) gcc/4.8.1 4) sge/univa 5) hdf5_18/1.8.11
[juser@proteusi01 ncbi_cxx--12_0_0]$ ./configure LDFLAGS="-L$HDF5DIR" CPPFLAGS="-I$HDF5INCLUDE" \
--prefix=/mnt/HA/opt/ncbi_cxx/gcc/12.0.0 --with-algo --with-png --with-tiff --with-pcre \
--with-z --with-mysql --with-check --with-boost --with-xerces --with-libxslt \
--with-sge=/cm/shared/apps/sge/univa --with-xalan --with-gif --with-jpeg --with-xpm \
--with-curl --with-hdf5=${HDF5DIR} \
--with-mt --with-64 --without-debug --with-optimization --with-dll --with-runpath
=========
2014-08-28
module list
1) shared 4) sge/univa 7) proteus-fftw3/gcc/64/3.3.3 10) boost/openmpi/gcc/64/1.56.0
2) proteus 5) proteus-blas/gcc/64/20110419 8) python/2.7.8 11) hdf5_18/1.8.11
3) gcc/4.8.1 6) proteus-lapack/gcc/64/3.5.0 9) proteus-openmpi/gcc/64/1.8.1-mlnx-ofed
export CFLAGS="-O3 -mavx -msse4.2 -mfpmath=sse"
export CXXFLAGS="${CFLAGS}"
export NCBIPREFIX="/mnt/HA/opt/ncbi-toolkit/gcc/64/12.0.0
./configure --prefix=${NCBIPREFIX} --with-mt --with-64 --with-lfs --with-check \
--with-bin-release --with-strip --with-sge=$SGE_ROOT --with-3psw=std:netopt \
--with-app --with-boost=$BOOSTDIR --with-optimization --without-debug \
--with-dll
Build happens in directory GCC481-ReleaseMTDLL64
.