Job Script Example 08 ADAM
NOT YET WORKING
This ADAM example is taken from Neil Ferguson's blog post at Big Data Genomics.[1]
Outline
"We will explain how to apply deep learning using artifical neural networks to predict which population group an individual belongs to – based entirely on his or her genomic data."
This example is run using ADAM 0.22.0 with Apache Spark 2.2.0 with Scala 2.11.
Code
The code is written in the Scala language. Please see N. Ferguson's blog post for an explanation.[2]
The code is compiled using Apache Maven
giving a jar file uber-popstrat-0.1-SNAPSHOT.jar
. After compilation,
that jar file will be in the directory "target" under the top level of
the popstrat source directory.
The jar file should be copied into the directory containing the job script.
Data
The analysis is done on a VCF file from the 1000 Genomes Project. It is a large file: 11 GB uncompressed. So, it is striped on the Lustre filesystem to distribute the i/o load.
Obtaining the Data
Please see the BD Genomics blog post[3] for further detail.
The files required are from the 1000 Genomes Project -- pick the
geographically closest server. The files are from the subdirectory
release/20130502/
:
- ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
- integrated_call_samples_v3.20130502.ALL.panel
Striping the Data
See Lustre Scratch Filesystem#File Striping for more information.
After retrieving the compressed data file
release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
,
cp it into a striped directory. NB do not use mv as that will
retain the original striping.
The following assumes the directory /lustre/scratch/juser/popstrat
already exists.
[juser@proteusi01 ~]$ cd /lustre/scratch/juser/popstrat
[juser@proteusi01 popstrat]$ mkdir striped
[juser@proteusi01 popstrat]$ lfs setstripe -c 12 striped
[juser@proteusi01 popstrat]$ lfs getstripe striped
[juser@proteusi01 popstrat]$ cp ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz striped
[juser@proteusi01 popstrat]$ cd striped
[juser@proteusi01 striped]$ gunzip ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
Job Script
The job script bears the same outline as all Apache Spark job scripts: a standalone Spark cluster is erected using the nodes granted by Grid Engine. Then, the actual computation job is submitted to the Spark cluster, using spark-submit. See Job Script Example 03 Apache Spark
#!/bin/bash
#$ -S /bin/bash
#$ -P FIXME
#$ -M FIXME@drexel.edu
#$ -m a
#$ -j y
#$ -cwd
#$ -jc spark2.intel
#$ -pe spark2.intel 64
#$ -l vendor=intel
#$ -l ua=sandybridge
#$ -l h_rt=6:00:00
#$ -l h_vmem=4G
#$ -l m_mem_free=2g
. /etc/profile.d/modules.sh
module load shared
module load gcc/4.8.1
module load sge/univa
module load proteus
module load adam/0.22.0
export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}
. ${SPARK_CONF_DIR}/spark-env.sh
echo "Starting master on ${SPARK_MASTER_HOST} ..."
start-master.sh
echo "Done starting master."
echo "Starting slaves..."
start-slaves.sh
echo "Done starting slaves."
echo "Submitting job..."
### NB
### * the jar file produced by compiling the PopStrat source code should be moved into the directory where the job will run
### * the .panel file, being small, should not be in the "striped" directory
spark-submit --class "com.neilferguson.PopStrat" \
--driver-memory 6G uber-popstrat-0.1-SNAPSHOT.jar \
striped/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf \
integrated_call_samples_v3.20130502.ALL.panel
echo "Done job."
echo "Stopping slaves..."
stop-slaves.sh
echo "Done stopping slaves."
echo "Stopping master..."
stop-master.sh
echo "Done stopping master."
### Optionally clean up all files.
echo "Cleanup..."
# wait for all processes to really finish up
sleep 12
/bin/rm -rf ${SPARK_CONF_DIR}
/bin/rm -rf ${SPARK_LOG_DIR}
/bin/rm -rf ${SPARK_WORKER_DIR}
/bin/rm ${SPARK_SLAVES}
echo "...Done."
References
[2]
[3]