Compiling NCBI C++ Toolkit
NCBI C++ Toolkit version 12.0.0[1] is installed on Proteus. Use the module
ncbi-toolkit/gcc/64/12.0.0
There may be module dependencies -- see any warning messages that appear with "module load ncbi-toolkit".
Basic Usage
Load the Module
First, load the module which provides the toolkit:
[juser@proteusa01 ~]$ module load ncbi-toolkit/gcc/64/12.0.0
Check that it worked:
[juser@proteusa01 ~]$ which update_blastdb.pl
/mnt/HA/opt/ncbi-toolkit/gcc/64/12.0.0/bin/update_blastdb.pl
DB Location
Decide on a directory where all the database files will be downloaded.
If you are working in a group which will share the database, put this in
your group directory. We will use ~/ncbi_db/
:
[juser@proteusa01 ~]$ mkdir ncbi_db
Or use the local copy. See BLAST Databases
Set the BLASTDB Environment Variable
You can set the BLASTDB environment variable[2] in one of your .bashrc
file. You can also set it manually in any job script you write.
export BLASTDB=~/ncbi_db
Download an Updated Database
We will use the nr
database as an example. If the BLASTDB
environment is not set, manually set it in the shell. (See above.)
See what databases are available:
[juser@proteusa01 ~]$ update_blastdb.pl --showall
Just in case the BLASTDB
environment variable is not properly used by
the tools, cd into it, and do the update -- this will take up to an
hour:
[juser@proteusa01 ~]$ cd $BLASTDB
[juser@proteusa01 ncbi_db]$ update_blastdb.pl nr
...
After it completes, check that all files were downloaded correctly by doing the checksum:
[juser@proteusa01 ncbi_db]$ md5sum -c *.md5
nr.01.tar.gz: OK
nr.02.tar.gz: OK
...
Uncompress them all:
[juser@proteusa01 ncbi_db]$ for x in nr.*.tar.gz ; do tar xf $x ; done
...
This produces many files: .phr, .psd, .psq, etc.
Delete the tarballs:
[juser@proteusa01 ncbi_db]$ rm -f *.tar.gz
Retain the *.md5
files so that the update_blastdb.pl
script can tell
which db is up to date.
Run Multithreaded
The installation of NCBI Toolkit on Proteus does not use MPI, but it is
multithreaded. That means it can use multiple processor cores on a
single compute node, but will not do computations using multiple compute
nodes. Most NCBI Toolkit command line tools have the option to specify
the number of threads. In a job script, the NSLOTS
environment
variable is set in the job to be the number of slots requested. So:
#$ -pe shm 8
...
blastx -num_threads ${NSLOTS} ...
WARNING
Using the NCBI-hosted databases by using the "-remote" option will get Proteus blocked by NCBI due to overuse. This is true especially for batch jobs on the cluster.
Compiling
[juser@proteusi01 ncbi_cxx--12_0_0]$ module list
Currently Loaded Modulefiles:
1) shared 2) proteus 3) gcc/4.8.1 4) sge/univa 5) hdf5_18/1.8.11
[juser@proteusi01 ncbi_cxx--12_0_0]$ ./configure LDFLAGS="-L$HDF5DIR" CPPFLAGS="-I$HDF5INCLUDE" \
--prefix=/mnt/HA/opt/ncbi_cxx/gcc/12.0.0 --with-algo --with-png --with-tiff --with-pcre \
--with-z --with-mysql --with-check --with-boost --with-xerces --with-libxslt \
--with-sge=/cm/shared/apps/sge/univa --with-xalan --with-gif --with-jpeg --with-xpm \
--with-curl --with-hdf5=${HDF5DIR} \
--with-mt --with-64 --without-debug --with-optimization --with-dll --with-runpath
=========
2014-08-28
module list
1) shared 4) sge/univa 7) proteus-fftw3/gcc/64/3.3.3 10) boost/openmpi/gcc/64/1.56.0
2) proteus 5) proteus-blas/gcc/64/20110419 8) python/2.7.8 11) hdf5_18/1.8.11
3) gcc/4.8.1 6) proteus-lapack/gcc/64/3.5.0 9) proteus-openmpi/gcc/64/1.8.1-mlnx-ofed
export CFLAGS="-O3 -mavx -msse4.2 -mfpmath=sse"
export CXXFLAGS="${CFLAGS}"
export NCBIPREFIX="/mnt/HA/opt/ncbi-toolkit/gcc/64/12.0.0
./configure --prefix=${NCBIPREFIX} --with-mt --with-64 --with-lfs --with-check \
--with-bin-release --with-strip --with-sge=$SGE_ROOT --with-3psw=std:netopt \
--with-app --with-boost=$BOOSTDIR --with-optimization --without-debug \
--with-dll
Build happens in directory GCC481-ReleaseMTDLL64
.