Job Script Example 05 pyRAD

This is an example of using PyRAD. The example may be found in:

/mnt/HA/opt/Examples/Example05

Please see the README file in that directory.

Introduction

This is a translation of the author's tutorial.[1] The original tutorial is presented in a Jupyter[2] notebook, which merely runs a lot of standard Linux commands. We extract them into typed-out commands, and a job script.

This translation will not include detailed commentary.

Load the necessary modulefiles:

[juser@proteusa01 pyrad-tut]$ module load python/2.7-current vsearch muscle pyrad

Download Sample Data

Download the example data set, and expand it.

[juser@proteusa01 pyrad-tut]$ wget -q dereneaton.com/downloads/simRADs.zip [juser@proteusa01 pyrad-tut]$ unzip simRADs.zip

This should create these two files:

simRADs.barcodes simRADs_R1.fastq.gz

Create pyRAD params.txt file

Create params.txt:

[juser@proteusa01 pyrad-tut]$ pyRAD -n

Modify the file to look like this:

==** parameter inputs for pyRAD version 3.0.5  **======================== affected step ==
./                        ## 1. Working directory                                 (all)
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 16)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 16)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                     ## 6. Restriction overhang (e.g., C|TGCAG -> TGCAG)     (s1,s2)
4                         ## 7. N processors (parallel)                           (all)
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
4                         ## 9. NQual: max # sites with qual < 20 (or see line 20)(s2)
.85                       ## 10. Wclust: clustering threshold as a decimal        (s3,s6)
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                         ## 12. MinCov: min samples in a final locus             (s7)
3                         ## 13. MaxSH: max inds with shared hetero site          (s7)
c85m4p3                   ## 14. Prefix name for final output (no spaces)         (s7)
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
                       ## 18.opt.: loc. of de-multiplexed data                      (s2)
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
8                      ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
                       ## 29.opt.: trim overhang left,right on final loci, def(0,0) (s7)
*                      ## 30.opt.: output formats: p,n,a,s,v,u,t,m,k,g,* (see docs) (s7)
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
16                     ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Suggestion for lines 7 and 37: TBA.

Create Job Script and Submit

Then, write this job script for submission.

#!/bin/bash
#$ -S /bin/bash
#$ -P myrsrchPrj
#$ -M myname@drexel.edu
#$ -cwd
#$ -j y
#$ -l h_rt=2:00:00
#$ -l m_mem_free=3g
#$ -l h_vmem=4g
#$ -l vendor=amd
#$ -pe shm 64

. /etc/profile.d/modules.sh
module load shared
module load sge/univa
module load gcc
module load proteus
module load python/2.7-current
module load vsearch
module load muscle
module load pyrad

pyRAD -p params.txt -s 1
pyRAD -p params.txt -s 2
pyRAD -p params.txt -s 3
pyRAD -p params.txt -s 4
pyRAD -p params.txt -s 5
pyRAD -p params.txt -s 6
pyRAD -p params.txt -s 7

### Instead of doing those seven steps separately, you should be able to do it all in one go.  See documentation.

The final stats will be in the file "stats/c85m4p3.stats".

Resource Usage

The resource usage from running this exact example is (see the file job_resource_usage.241240.txt):

==============================================================
qname        all.q
hostname     ac03n02.cm.cluster
group        urcfadm
owner        dwc62
project      urcfadmPrj
department   defaultdepartment
jobname      pyradtest.sh
jobnumber    241240
taskid       undefined
account      sge
priority     0
cwd          /lustre/scratch/examples/pyRAD
submit_host  proteusi01.cm.cluster
submit_cmd   qsub pyradtest.sh
qsub_time    08/11/2015 00:30:25.173
start_time   08/11/2015 00:30:26.327
end_time     08/11/2015 00:39:16.929
granted_pe   shm
slots        64
failed       0
deleted_by   NONE
exit_status  0
ru_wallclock 530.602
ru_utime     1571.871
ru_stime     312.234
ru_maxrss    67260
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    33705489
ru_majflt    94
ru_nswap     0
ru_inblock   66576
ru_oublock   198264
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     263490
ru_nivcsw    58258
cpu          1884.105
mem          583.045
io           1.255
iow          0.000
maxvmem      2.855G
maxrss       229.281M
maxpss       164.399M
arid         undefined
jc_name      NONE

Highlights:

run time (wallclock): about 9 minutes
total memory usage (maxvmem): 2.855 GiB

References

[1] pyRAD SE RAD Tutorial - Example de novo RADseq assembly using pyRAD

[2] Jupyter - web-based GUI workbook format