Slurm Utility Commands
Installed on Picotte is a small set of aliases and scripts which provide
some useful defaults for commands like sacct
and sinfo
To use, load the modulefile:
slurm_util
Commands/Aliases Available
sinfo_detail
Defines a set of parameters for sinfo
.[1] This shows the current state
of all nodes:
[juser@picotte001 ~]$ sinfo_detail
NODELIST NODES PART STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
bigmem001 1 bm draining 48 4:12:1 1546000 1724000 1 (null) Maintenance
bigmem002 1 bm idle 48 4:12:1 1546000 1724000 1 (null) none
gpu001 1 gpu idle 48 4:12:1 192000 1724000 1 (null) none
gpu001 1 gpul idle 48 4:12:1 192000 1724000 1 (null) none
gpu002 1 gpu idle 48 4:12:1 192000 1724000 1 (null) none
...
gpu012 1 gpu down* 48 4:12:1 192000 1724000 1 (null) Not responding
gpu012 1 gpul down* 48 4:12:1 192000 1724000 1 (null) Not responding
node001 1 long allocated 48 4:12:1 192000 874000 1 (null) none
node001 1 def* allocated 48 4:12:1 192000 874000 1 (null) none
node002 1 long mixed 48 4:12:1 192000 874000 1 (null) none
node002 1 def* mixed 48 4:12:1 192000 874000 1 (null) none
node003 1 long mixed 48 4:12:1 192000 874000 1 (null) none
node003 1 def* mixed 48 4:12:1 192000 874000 1 (null) none
node004 1 long idle 48 4:12:1 192000 874000 1 (null) none
node004 1 def* idle 48 4:12:1 192000 874000 1 (null) none
...
node074 1 long idle 48 4:12:1 192000 874000 1 (null) none
node074 1 def* idle 48 4:12:1 192000 874000 1 (null) none
sacct_detail
This defines a number of parameters to display in the output of
sacct
.[2]
NB the output of sacct
may be inaccurate while a job is running. It is
most useful after a job has concluded, successfully or not.
[juser@picotte001 ~]$ sacct_detail -j 12345
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
12345 program juser def node003 02:58:35 RUNNING 0:0 billing=12,cpu=12,node=1
12345.batch batch node003 02:58:35 RUNNING 0:0 cpu=12,mem=0,node=1
12345.extern extern node003 02:58:35 RUNNING 0:0 billing=12,cpu=12,node=1
[juser@picotte001 ~]$ sacct_detail -j 80273
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
80273_1 hplcuda_8+ juser gpu gpu001 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_1.batch batch gpu001 00:00:03 COMPLETED 0:0 1048K 153376K cpu=4,mem=0,node=1 gpu:4
80273_1.extern extern gpu001 00:00:03 COMPLETED 0:0 496K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_2 hplcuda_8+ juser gpu gpu002 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_2.batch batch gpu002 00:00:03 COMPLETED 0:0 1049K 153376K cpu=4,mem=0,node=1 gpu:4
80273_2.extern extern gpu002 00:00:03 COMPLETED 0:0 484K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_3 hplcuda_8+ juser gpu gpu007 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_3.batch batch gpu007 00:00:03 COMPLETED 0:0 1048K 153376K cpu=4,mem=0,node=1 gpu:4
80273_3.extern extern gpu007 00:00:03 COMPLETED 0:0 444K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_4 hplcuda_8+ juser gpu gpu003 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_4.batch batch gpu003 00:00:03 COMPLETED 0:0 1049K 153376K cpu=4,mem=0,node=1 gpu:4
80273_4.extern extern gpu003 00:00:03 COMPLETED 0:0 8K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_5 hplcuda_8+ juser gpu gpu008 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_5.batch batch gpu008 00:00:03 COMPLETED 0:0 1048K 153376K cpu=4,mem=0,node=1 gpu:4
80273_5.extern extern gpu008 00:00:03 COMPLETED 0:0 482K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_6 hplcuda_8+ dwc62 gpu gpu004 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_6.batch batch gpu004 00:00:03 COMPLETED 0:0 1048K 153376K cpu=4,mem=0,node=1 gpu:4
80273_6.extern extern gpu004 00:00:03 COMPLETED 0:0 4K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_7 hplcuda_8+ juser gpu gpu009 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_7.batch batch gpu009 00:00:03 COMPLETED 0:0 1049K 153376K cpu=4,mem=0,node=1 gpu:4
80273_7.extern extern gpu009 00:00:03 COMPLETED 0:0 480K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_8 hplcuda_8+ juser gpu gpu005 00:00:03 COMPLETED 0:0 billing=172,cpu=4,gres/gpu=4,no+ gpu:4
80273_8.batch batch gpu005 00:00:03 COMPLETED 0:0 1049K 153376K cpu=4,mem=0,node=1 gpu:4
80273_8.extern extern gpu005 00:00:03 COMPLETED 0:0 8K 153196K billing=172,cpu=4,gres/gpu=4,no+ gpu:4
NOTE
- In the
sacct
output, look at the ".extern
" line. An exit code of "0:0
" means the job completed successfully. "OUT_OF_MEMORY
" states in lines before the ".extern
" line can be safely disregarded. For example, see the lines for job task12345_3
in the output below:
[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 12345
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
12345_1 Run.sh juser bm bigmem002 00:06:18 COMPLETED 0:0 cpu=12,node=1
12345_1.batch batch bigmem002 00:06:18 COMPLETED 0:0 4843835K 5222088K cpu=12,mem=0,node=1
12345_1.extern extern bigmem002 00:06:18 COMPLETED 0:0 488K 153196K cpu=12,node=1
12345_2 Run.sh juser bm bigmem002 00:05:59 COMPLETED 0:0 cpu=12,node=1
12345_2.batch batch bigmem002 00:05:59 COMPLETED 0:0 4180691K 4558920K cpu=12,mem=0,node=1
12345_2.extern extern bigmem002 00:05:59 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_3 Run.sh juser bm bigmem002 00:03:53 OUT_OF_ME+ 0:125 cpu=12,node=1
12345_3.batch batch bigmem002 00:03:53 OUT_OF_ME+ 0:125 1539337K 1917516K cpu=12,mem=0,node=1
12345_3.extern extern bigmem002 00:03:53 COMPLETED 0:0 142K 153196K cpu=12,node=1
12345_4 Run.sh juser bm bigmem002 00:03:13 COMPLETED 0:0 cpu=12,node=1
squeue_detail
squeue_detail
shows a couple more fields than the default.
squeue_long
squeue_long
additionally shows: MAX_CPUS, TRES (trackable resources)
including billing.
seff_array
seff_array
[3] is a utility created by the Yale Center for Research Computing. It runs seff
on
a job array, and displays histograms in the terminal. (seff
is
undocumented: it takes one argument, a job ID.)
Some advice on how to adjust your job is given in the output.
[juser@picotte ~]$ seff_array 123456
========== Max Memory Usage ==========
# NumSamples = 90; Min = 896.29 MB; Max = 900.48 MB
# Mean = 897.77 MB; Variance = 0.40 MB; SD = 0.63 MB; Median 897.78 MB
# each ∎ represents a count of 1
806.6628 - 896.7108 MB [ 2]: ∎∎
896.7108 - 897.1296 MB [ 9]: ∎∎∎∎∎∎∎∎∎
897.1296 - 897.5484 MB [ 21]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
897.5484 - 897.9672 MB [ 34]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
897.9672 - 898.3860 MB [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
898.3860 - 898.8048 MB [ 4]: ∎∎∎∎
898.8048 - 899.2236 MB [ 1]: ∎
899.2236 - 899.6424 MB [ 3]: ∎∎∎
899.6424 - 900.0612 MB [ 0]:
900.0612 - 900.4800 MB [ 1]: ∎
The requested memory was 2000MB.
========== Elapsed Time ==========
# NumSamples = 90; Min = 00:03:25.0; Max = 00:07:24.0
# Mean = 00:05:45.0; SD = 00:01:39.0; Median 00:06:44.0
# each ∎ represents a count of 1
00:03:5.0 - 00:03:48.0 [ 30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
00:03:48.0 - 00:04:11.0 [ 0]:
00:04:11.0 - 00:04:34.0 [ 0]:
00:04:34.0 - 00:04:57.0 [ 0]:
00:04:57.0 - 00:05:20.0 [ 0]:
00:05:20.0 - 00:05:43.0 [ 0]:
00:05:43.0 - 00:06:6.0 [ 0]:
00:06:6.0 - 00:06:29.0 [ 0]:
00:06:29.0 - 00:06:52.0 [ 30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
00:06:52.0 - 00:07:15.0 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
********************************************************************************
The requested runtime was 01:00:00.
The average runtime was 00:05:45.0.
Requesting less time would allow jobs to run more quickly.
********************************************************************************
References
[1] Slurm Documentation - sinfo