Diagnosing Job Problems
Often, when one is first getting started on the cluster, the resource requirements for a job are not known well. The scheduler will terminate jobs which exceed their resource requests or the limit requests. This is in order to avoid interfering with other jobs which may happen to be running on the same nodes.
Job Scripts vs. One-Liners
While it may be somewhat satisfying to submit jobs with a single command typed out directly at the system prompt, e.g.
sbatch -p bm -o myoutput-%A_%a.out --mail-user=myname@drexel.edu -A myrsrchPrj -N 4 --ntasks-per-node=12 -c 4 --mem-per-cpu=3G -t 24:00:00 'cd /ifs/groups/myrsrchGrp/thisjob && PATH=./bin:${PATH} && myprog -a arg1 -b arg2 -c arg3 -i /scratch/myname/thisjob/input.dat -o /scratch/myname/thisjob/output.dat'
it really hampers debugging.
Even for the shortest jobs, writing a script helps in the debugging process, and it also keeps a record of what was done. Compare:
#!/bin/bash
#SBATCH -p bm
#SBATCH --mail-user=myname@drexel.edu
#SBATCH -A myrsrchPrj
#SBATCH -N 4
#SBATCH --ntasks-per-node=12
#SBATCH -c 4
#SBATCH --mem-per-cpu=3G
#SBATCH -t 24:00:00
cd /ifs/groups/myrsrchGrp/thisjob
PATH=./bin:${PATH}
export DATADIR=/scratch/myname/thisjob
myprog -a arg1 -b arg2 -c arg3 -i ${DATADIR}/input.dat -o ${DATADIR}/output.dat
"Normal" Debugging
Standard debugging techniques still apply.
- Use "echo" statements to print out values of relevant environment variables, or just to indicate where in the script the job is
- Use any verbose or debugging options to the program you are running to print out data relevant to the program you are running
- Start with a small job (small in size of input set, short run time, etc.)
Slurm (Picotte)
- See also: Slurm Utility Commands
sacct
The sacct
command shows information about a job after it has
completed.
[juser@picotte001 ~]$ sacct -j 1234567
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1234567_5567 evaluate_+ def someprj 3 FAILED 2:0
1234567_556+ batch someprj 3 FAILED 2:0
1234567_556+ extern someprj 3 COMPLETED 0:0
The slurm_util
modulefile provides some aliases for Slurm commands
with more informative options.
[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 1234567
JobID JobName User Partition NodeList Elapsed State ExitCode ReqMem MaxRSS MaxVMSize AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- --------------------------------
1234567_5567 evaluate_+ myname def node061 00:00:00 FAILED 2:0 4M billing=3,cpu=3,mem=4M,node=1
1234567_5567.batch batch node061 00:00:00 FAILED 2:0 856K 157440K cpu=3,mem=4M,node=1
1234567_5567.extern extern node061 00:00:00 COMPLETED 0:0 4K 157364K billing=3,cpu=3,mem=4M,node=1
seff
Undocumented. Shows the efficiency of a completed job. (Data shown on a running job may be inaccurate.)
[juser@picotte001 ~]$ seff 12345
Job ID: 12345
Cluster: picotte
User/Group: juser/juser
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 12
CPU Utilized: 00:00:25
CPU Efficiency: 8.01% of 00:05:12 core-walltime
Job Wall-clock time: 00:00:26
Memory Utilized: 1.02 MB
Memory Efficiency: 0.00% of 45.70 GB
For array jobs, use the seff_array
command.
Grid Engine (Proteus)
qstat
The qstat
command can give detailed information about the job
(except for scheduling information). If the job is in state "Eqw", i.e.
in error, queued, and waiting, there will be an error reason listed.
E.g.
[juser@proteusa01 ~]$ qstat -j 123456
==============================================================
job_number: 123456
jclass: NONE
exec_file: job_scripts/123456
...
sge_o_workdir: /home/juser/myjob
sge_o_host: proteusa01
account: sge
cwd: /home/jmyjob
...
error`` ``reason`` ``1:`` ``02/12/2015`` ``14:31:56`` ``[1062:71145]:`` ``execvp(/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456,
"/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456")`` ``failed:`` ``No`` ``such`` ``file`` ``or`` ``directory
scheduling info: (Collecting of scheduler job information is turned off)
qacct
The main tool in understanding the resource usage for your job is
qacct
. Once your job has completed, do:
[myname@proteusi01 ~]$ qacct -j
job_id
==============================================================
qname all.q
hostname ac06n02.cm.cluster
group myrsrchGrp
owner myname
project myrsrchPrj
department defaultdepartment
jobname thisjob.sh
jobnumber 123456
...
qsub_time 01/21/2015 13:50:48.070
start_time 01/21/2015 13:50:49.003
end_time 01/21/2015 17:04:39.883
granted_pe NONE
slots 1
failed 100 : assumedly after job
deleted_by NONE
exit_status 137
ru_wallclock 11630.880
...
cpu 5810.160
mem 10402.699
io 11.938
iow 0.000
maxvmem 4.742G
...
A lot of the output is not shown. This output shows that the job exited
with status 137 (= 128 + 9). Kill signal 9 indicates an unconditional
kill. In the case of jobs under the scheduler, that indicates that the
job was terminated by the scheduler. (Not exactly true, but it suffices
for the purposes here.) Here we see the maxvmem
usage of 4.742 GiB.
This is the actual memory usage while the job was running. This can be
compared with the job resource request. Unlike h_vmem
in resource
requests, where it is a per-slot value, the maxvmem
reported here is a
total over the entire job.
Details of the reported parameters are in the man page for
accounting(5)
.[1]