Diagnosing Job Problems

Often, when one is first getting started on the cluster, the resource requirements for a job are not known well. The scheduler will terminate jobs which exceed their resource requests or the limit requests. This is in order to avoid interfering with other jobs which may happen to be running on the same nodes.

Job Scripts vs. One-Liners

While it may be somewhat satisfying to submit jobs with a single command typed out directly at the system prompt, e.g.

sbatch -p bm -o myoutput-%A_%a.out --mail-user=myname@drexel.edu -A myrsrchPrj -N 4 --ntasks-per-node=12 -c 4 --mem-per-cpu=3G -t 24:00:00 'cd /ifs/groups/myrsrchGrp/thisjob && PATH=./bin:${PATH} && myprog -a arg1 -b arg2 -c arg3 -i /scratch/myname/thisjob/input.dat -o /scratch/myname/thisjob/output.dat'

it really hampers debugging.

Even for the shortest jobs, writing a script helps in the debugging process, and it also keeps a record of what was done. Compare:

#!/bin/bash
#SBATCH -p bm
#SBATCH --mail-user=myname@drexel.edu
#SBATCH -A myrsrchPrj
#SBATCH -N 4
#SBATCH --ntasks-per-node=12
#SBATCH -c 4
#SBATCH --mem-per-cpu=3G
#SBATCH -t 24:00:00

cd /ifs/groups/myrsrchGrp/thisjob
PATH=./bin:${PATH}
export DATADIR=/scratch/myname/thisjob
myprog -a arg1 -b arg2 -c arg3 -i ${DATADIR}/input.dat -o ${DATADIR}/output.dat

"Normal" Debugging

Standard debugging techniques still apply.

Use "echo" statements to print out values of relevant environment variables, or just to indicate where in the script the job is
Use any verbose or debugging options to the program you are running to print out data relevant to the program you are running
Start with a small job (small in size of input set, short run time, etc.)

Slurm (Picotte)

See also: Slurm Utility Commands

sacct

The sacct command shows information about a job after it has completed.

[juser@picotte001 ~]$ sacct -j 1234567
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1234567_5567 evaluate_+        def    someprj          3     FAILED      2:0
1234567_556+      batch               someprj          3     FAILED      2:0
1234567_556+     extern               someprj          3  COMPLETED      0:0

The slurm_util modulefile provides some aliases for Slurm commands with more informative options.

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 1234567
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                        AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- --------------------------------
        1234567_5567 evaluate_+    myname        def         node061   00:00:00     FAILED      2:0         4M                          billing=3,cpu=3,mem=4M,node=1
  1234567_5567.batch      batch                              node061   00:00:00     FAILED      2:0                  856K    157440K              cpu=3,mem=4M,node=1
 1234567_5567.extern     extern                              node061   00:00:00  COMPLETED      0:0                    4K    157364K    billing=3,cpu=3,mem=4M,node=1

seff

Undocumented. Shows the efficiency of a completed job. (Data shown on a running job may be inaccurate.)

[juser@picotte001 ~]$ seff 12345
Job ID: 12345
Cluster: picotte
User/Group: juser/juser
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 12
CPU Utilized: 00:00:25
CPU Efficiency: 8.01% of 00:05:12 core-walltime
Job Wall-clock time: 00:00:26
Memory Utilized: 1.02 MB
Memory Efficiency: 0.00% of 45.70 GB

For array jobs, use the seff_array command.

Grid Engine (Proteus)

qstat

The qstat command can give detailed information about the job (except for scheduling information). If the job is in state "Eqw", i.e. in error, queued, and waiting, there will be an error reason listed. E.g.

[juser@proteusa01 ~]$ qstat -j 123456 ============================================================== job_number: 123456 jclass: NONE exec_file: job_scripts/123456 ... sge_o_workdir: /home/juser/myjob sge_o_host: proteusa01 account: sge cwd: /home/jmyjob ... error`` ``reason`` ``1:`` ``02/12/2015`` ``14:31:56`` ``[1062:71145]:`` ``execvp(/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456, "/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456")`` ``failed:`` ``No`` ``such`` ``file`` ``or`` ``directory scheduling info: (Collecting of scheduler job information is turned off)

qacct

The main tool in understanding the resource usage for your job is qacct. Once your job has completed, do:

[myname@proteusi01 ~]$ qacct -jjob_id ============================================================== qname all.q hostname ac06n02.cm.cluster group myrsrchGrp owner myname project myrsrchPrj department defaultdepartment jobname thisjob.sh jobnumber 123456 ... qsub_time 01/21/2015 13:50:48.070 start_time 01/21/2015 13:50:49.003 end_time 01/21/2015 17:04:39.883 granted_pe NONE slots 1 failed 100 : assumedly after job deleted_by NONE exit_status 137 ru_wallclock 11630.880 ... cpu 5810.160 mem 10402.699 io 11.938 iow 0.000 maxvmem 4.742G ...

A lot of the output is not shown. This output shows that the job exited with status 137 (= 128 + 9). Kill signal 9 indicates an unconditional kill. In the case of jobs under the scheduler, that indicates that the job was terminated by the scheduler. (Not exactly true, but it suffices for the purposes here.) Here we see the maxvmem usage of 4.742 GiB. This is the actual memory usage while the job was running. This can be compared with the job resource request. Unlike h_vmem in resource requests, where it is a per-slot value, the maxvmem reported here is a total over the entire job.

Details of the reported parameters are in the man page for accounting(5).[1]

References

[1] Miscellaneous Tips#man pages

Diagnosing Job Problems

Job Scripts vs. One-Liners

"Normal" Debugging

Slurm (Picotte)

sacct

seff

Grid Engine (Proteus)

qstat

qacct

See Also

Picotte

Grid Engine (Proteus)

General

References