Skip to content

Job Script Example 08 ADAM

NOT YET WORKING

This ADAM example is taken from Neil Ferguson's blog post at Big Data Genomics.[1]

Outline

"We will explain how to apply deep learning using artifical neural networks to predict which population group an individual belongs to – based entirely on his or her genomic data."

This example is run using ADAM 0.22.0 with Apache Spark 2.2.0 with Scala 2.11.

Code

The code is written in the Scala language. Please see N. Ferguson's blog post for an explanation.[2]

The code is compiled using Apache Maven giving a jar file uber-popstrat-0.1-SNAPSHOT.jar. After compilation, that jar file will be in the directory "target" under the top level of the popstrat source directory.

The jar file should be copied into the directory containing the job script.

Data

The analysis is done on a VCF file from the 1000 Genomes Project. It is a large file: 11 GB uncompressed. So, it is striped on the Lustre filesystem to distribute the i/o load.

Obtaining the Data

Please see the BD Genomics blog post[3] for further detail.

The files required are from the 1000 Genomes Project -- pick the geographically closest server. The files are from the subdirectory release/20130502/:

  • ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
  • integrated_call_samples_v3.20130502.ALL.panel

Striping the Data

See Lustre Scratch Filesystem#File Striping for more information.

After retrieving the compressed data file release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz, cp it into a striped directory. NB do not use mv as that will retain the original striping.

The following assumes the directory /lustre/scratch/juser/popstrat already exists.

[juser@proteusi01 ~]$ cd /lustre/scratch/juser/popstrat
[juser@proteusi01 popstrat]$ mkdir striped
[juser@proteusi01 popstrat]$ lfs setstripe -c 12 striped
[juser@proteusi01 popstrat]$ lfs getstripe striped
[juser@proteusi01 popstrat]$ cp ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz striped
[juser@proteusi01 popstrat]$ cd striped
[juser@proteusi01 striped]$ gunzip ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Job Script

The job script bears the same outline as all Apache Spark job scripts: a standalone Spark cluster is erected using the nodes granted by Grid Engine. Then, the actual computation job is submitted to the Spark cluster, using spark-submit. See Job Script Example 03 Apache Spark

#!/bin/bash
#$ -S /bin/bash
#$ -P FIXME
#$ -M FIXME@drexel.edu
#$ -m a
#$ -j y
#$ -cwd
#$ -jc spark2.intel
#$ -pe spark2.intel 64
#$ -l vendor=intel
#$ -l ua=sandybridge
#$ -l h_rt=6:00:00
#$ -l h_vmem=4G
#$ -l m_mem_free=2g
. /etc/profile.d/modules.sh
module load shared
module load gcc/4.8.1
module load sge/univa
module load proteus
module load adam/0.22.0

export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}
. ${SPARK_CONF_DIR}/spark-env.sh

echo "Starting master on ${SPARK_MASTER_HOST} ..."
start-master.sh
echo "Done starting master."

echo "Starting slaves..."
start-slaves.sh
echo "Done starting slaves."

echo "Submitting job..."

### NB
###   * the jar file produced by compiling the PopStrat source code should be moved into the directory where the job will run
###   * the .panel file, being small, should not be in the "striped" directory

spark-submit --class "com.neilferguson.PopStrat" \
    --driver-memory 6G uber-popstrat-0.1-SNAPSHOT.jar \
    striped/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf \
    integrated_call_samples_v3.20130502.ALL.panel

echo "Done job."

echo "Stopping slaves..."
stop-slaves.sh
echo "Done stopping slaves."

echo "Stopping master..."
stop-master.sh
echo "Done stopping master."

### Optionally clean up all files.
echo "Cleanup..."
# wait for all processes to really finish up
sleep 12
/bin/rm -rf ${SPARK_CONF_DIR}
/bin/rm -rf ${SPARK_LOG_DIR}
/bin/rm -rf ${SPARK_WORKER_DIR}
/bin/rm ${SPARK_SLAVES}
echo "...Done."

References

[1] Big Data Genomics blog - Genomic Analysis Using ADAM, Spark and Deep Learning (Neil Ferguson; dated Jul 10, 2015; retrieved Oct 5, 2017)

[2]

[3]