Skip to content

Diagnosing Job Problems

Often, when one is first getting started on the cluster, the resource requirements for a job are not known well. The scheduler will terminate jobs which exceed their resource requests or the limit requests. This is in order to avoid interfering with other jobs which may happen to be running on the same nodes.

Job Scripts vs. One-Liners

While it may be somewhat satisfying to submit jobs with a single command typed out directly at the system prompt, e.g.

sbatch -p bm -o myoutput-%A_%a.out -A myrsrchPrj -N 4 --ntasks-per-node=12 -c 4 --mem-per-cpu=3G -t 24:00:00 'cd /ifs/groups/myrsrchGrp/thisjob && PATH=./bin:${PATH} && myprog -a arg1 -b arg2 -c arg3 -i /scratch/myname/thisjob/input.dat -o /scratch/myname/thisjob/output.dat'

it really hampers debugging.

Even for the shortest jobs, writing a script helps in the debugging process, and it also keeps a record of what was done. Compare:

#SBATCH -p bm
#SBATCH -A myrsrchPrj
#SBATCH --ntasks-per-node=12
#SBATCH -c 4
#SBATCH --mem-per-cpu=3G
#SBATCH -t 24:00:00

cd /ifs/groups/myrsrchGrp/thisjob
export DATADIR=/scratch/myname/thisjob
myprog -a arg1 -b arg2 -c arg3 -i ${DATADIR}/input.dat -o ${DATADIR}/output.dat

"Normal" Debugging

Standard debugging techniques still apply.

  • Use "echo" statements to print out values of relevant environment variables, or just to indicate where in the script the job is
  • Use any verbose or debugging options to the program you are running to print out data relevant to the program you are running
  • Start with a small job (small in size of input set, short run time, etc.)

Slurm (Picotte)


The sacct command shows information about a job after it has completed.

[juser@picotte001 ~]$ sacct -j 1234567
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1234567_5567 evaluate_+        def    someprj          3     FAILED      2:0
1234567_556+      batch               someprj          3     FAILED      2:0
1234567_556+     extern               someprj          3  COMPLETED      0:0

The slurm_util modulefile provides some aliases for Slurm commands with more informative options.

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 1234567
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                        AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- --------------------------------
        1234567_5567 evaluate_+    myname        def         node061   00:00:00     FAILED      2:0         4M                          billing=3,cpu=3,mem=4M,node=1
  1234567_5567.batch      batch                              node061   00:00:00     FAILED      2:0                  856K    157440K              cpu=3,mem=4M,node=1
 1234567_5567.extern     extern                              node061   00:00:00  COMPLETED      0:0                    4K    157364K    billing=3,cpu=3,mem=4M,node=1


Undocumented. Shows the efficiency of a completed job. (Data shown on a running job may be inaccurate.)

[juser@picotte001 ~]$ seff 12345
Job ID: 12345
Cluster: picotte
User/Group: juser/juser
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 12
CPU Utilized: 00:00:25
CPU Efficiency: 8.01% of 00:05:12 core-walltime
Job Wall-clock time: 00:00:26
Memory Utilized: 1.02 MB
Memory Efficiency: 0.00% of 45.70 GB

For array jobs, use the seff_array command.

Grid Engine (Proteus)


The qstat command can give detailed information about the job (except for scheduling information). If the job is in state "Eqw", i.e. in error, queued, and waiting, there will be an error reason listed. E.g.

[juser@proteusa01 ~]$ qstat -j 123456 ============================================================== job_number:                 123456 jclass:                     NONE exec_file:                  job_scripts/123456 ... sge_o_workdir:              /home/juser/myjob sge_o_host:                 proteusa01 account:                    sge cwd:                        /home/jmyjob ... error`` ``reason`` ``1:`` ``02/12/2015`` ``14:31:56`` ``[1062:71145]:`` ``execvp(/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456, "/cm/local/apps/sge/var/spool/ic08n01/job_scripts/123456")`` ``failed:`` ``No`` ``such`` ``file`` ``or`` ``directory scheduling info:            (Collecting of scheduler job information is turned off)


The main tool in understanding the resource usage for your job is qacct. Once your job has completed, do:

[myname@proteusi01 ~]$ qacct -jjob_id ============================================================== qname        all.q hostname group        myrsrchGrp owner        myname project      myrsrchPrj department   defaultdepartment jobname jobnumber    123456 ... qsub_time    01/21/2015 13:50:48.070 start_time   01/21/2015 13:50:49.003 end_time     01/21/2015 17:04:39.883 granted_pe   NONE slots        1 failed       100 : assumedly after job deleted_by   NONE exit_status  137 ru_wallclock 11630.880 ... cpu          5810.160 mem          10402.699 io           11.938 iow          0.000 maxvmem      4.742G ...

A lot of the output is not shown. This output shows that the job exited with status 137 (= 128 + 9). Kill signal 9 indicates an unconditional kill. In the case of jobs under the scheduler, that indicates that the job was terminated by the scheduler. (Not exactly true, but it suffices for the purposes here.) Here we see the maxvmem usage of 4.742 GiB. This is the actual memory usage while the job was running. This can be compared with the job resource request. Unlike h_vmem in resource requests, where it is a per-slot value, the maxvmem reported here is a total over the entire job.

Details of the reported parameters are in the man page for accounting(5).[1]

See Also


Grid Engine (Proteus)



[1] Miscellaneous Tips#man pages