Slurm Utility Commands

Installed on Picotte is a small set of aliases and scripts which provide some useful defaults for commands like sacct and sinfo

To use, load the modulefile:

slurm_util

Commands/Aliases Available

sinfo_detail

Defines a set of parameters for sinfo.[1] This shows the current state of all nodes:

[juser@picotte001 ~]$ sinfo_detail
NODELIST      NODES PART       STATE CPUS    S:C:T   MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
bigmem001         1 bm      draining   48   4:12:1  1546000  1724000      1   (null) Maintenance
bigmem002         1 bm          idle   48   4:12:1  1546000  1724000      1   (null) none
gpu001            1 gpu         idle   48   4:12:1   192000  1724000      1   (null) none
gpu001            1 gpul        idle   48   4:12:1   192000  1724000      1   (null) none
gpu002            1 gpu         idle   48   4:12:1   192000  1724000      1   (null) none
...
gpu012            1 gpu        down*   48   4:12:1   192000  1724000      1   (null) Not responding
gpu012            1 gpul       down*   48   4:12:1   192000  1724000      1   (null) Not responding
node001           1 long   allocated   48   4:12:1   192000   874000      1   (null) none
node001           1 def*   allocated   48   4:12:1   192000   874000      1   (null) none
node002           1 long       mixed   48   4:12:1   192000   874000      1   (null) none
node002           1 def*       mixed   48   4:12:1   192000   874000      1   (null) none
node003           1 long       mixed   48   4:12:1   192000   874000      1   (null) none
node003           1 def*       mixed   48   4:12:1   192000   874000      1   (null) none
node004           1 long        idle   48   4:12:1   192000   874000      1   (null) none
node004           1 def*        idle   48   4:12:1   192000   874000      1   (null) none
...
node074           1 long        idle   48   4:12:1   192000   874000      1   (null) none
node074           1 def*        idle   48   4:12:1   192000   874000      1   (null) none

sacct_detail

This defines a number of parameters to display in the output of sacct.[2]

NB the output of sacct may be inaccurate while a job is running. It is most useful after a job has concluded, successfully or not.

[juser@picotte001 ~]$ sacct_detail -j 12345
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
               12345    program     juser        def         node003   02:58:35    RUNNING      0:0                               billing=12,cpu=12,node=1
         12345.batch      batch                              node003   02:58:35    RUNNING      0:0                                    cpu=12,mem=0,node=1
        12345.extern     extern                              node003   02:58:35    RUNNING      0:0                               billing=12,cpu=12,node=1

[juser@picotte001 ~]$ sacct_detail -j 80273
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
             80273_1 hplcuda_8+     juser        gpu          gpu001   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_1.batch      batch                               gpu001   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_1.extern     extern                               gpu001   00:00:03  COMPLETED      0:0       496K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_2 hplcuda_8+     juser        gpu          gpu002   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_2.batch      batch                               gpu002   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_2.extern     extern                               gpu002   00:00:03  COMPLETED      0:0       484K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_3 hplcuda_8+     juser        gpu          gpu007   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_3.batch      batch                               gpu007   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_3.extern     extern                               gpu007   00:00:03  COMPLETED      0:0       444K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_4 hplcuda_8+     juser        gpu          gpu003   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_4.batch      batch                               gpu003   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_4.extern     extern                               gpu003   00:00:03  COMPLETED      0:0         8K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_5 hplcuda_8+     juser        gpu          gpu008   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_5.batch      batch                               gpu008   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_5.extern     extern                               gpu008   00:00:03  COMPLETED      0:0       482K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_6 hplcuda_8+     dwc62        gpu          gpu004   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_6.batch      batch                               gpu004   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_6.extern     extern                               gpu004   00:00:03  COMPLETED      0:0         4K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_7 hplcuda_8+     juser        gpu          gpu009   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_7.batch      batch                               gpu009   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_7.extern     extern                               gpu009   00:00:03  COMPLETED      0:0       480K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
             80273_8 hplcuda_8+     juser        gpu          gpu005   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
       80273_8.batch      batch                               gpu005   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
      80273_8.extern     extern                               gpu005   00:00:03  COMPLETED      0:0         8K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4

NOTE

In the sacct output, look at the ".extern" line. An exit code of "0:0" means the job completed successfully. "OUT_OF_MEMORY" states in lines before the ".extern" line can be safely disregarded. For example, see the lines for job task 12345_3 in the output below:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 12345
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
             12345_1     Run.sh     juser         bm       bigmem002   00:06:18  COMPLETED      0:0                                          cpu=12,node=1
       12345_1.batch      batch                            bigmem002   00:06:18  COMPLETED      0:0   4843835K   5222088K              cpu=12,mem=0,node=1
      12345_1.extern     extern                            bigmem002   00:06:18  COMPLETED      0:0       488K    153196K                    cpu=12,node=1
             12345_2     Run.sh     juser         bm       bigmem002   00:05:59  COMPLETED      0:0                                          cpu=12,node=1
       12345_2.batch      batch                            bigmem002   00:05:59  COMPLETED      0:0   4180691K   4558920K              cpu=12,mem=0,node=1
      12345_2.extern     extern                            bigmem002   00:05:59  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_3     Run.sh     juser         bm       bigmem002   00:03:53 OUT_OF_ME+    0:125                                          cpu=12,node=1
       12345_3.batch      batch                            bigmem002   00:03:53 OUT_OF_ME+    0:125   1539337K   1917516K              cpu=12,mem=0,node=1
      12345_3.extern     extern                            bigmem002   00:03:53  COMPLETED      0:0       142K    153196K                    cpu=12,node=1
             12345_4     Run.sh     juser         bm       bigmem002   00:03:13  COMPLETED      0:0                                          cpu=12,node=1

squeue_detail

squeue_detail shows a couple more fields than the default.

squeue_long

squeue_long additionally shows: MAX_CPUS, TRES (trackable resources) including billing.

seff_array

seff_array[3] is a utility created by the Yale Center for Research Computing. It runs seff on a job array, and displays histograms in the terminal. (seff is undocumented: it takes one argument, a job ID.)

Some advice on how to adjust your job is given in the output.

[juser@picotte ~]$ seff_array 123456
========== Max Memory Usage ==========
# NumSamples = 90; Min = 896.29 MB; Max = 900.48 MB
# Mean = 897.77 MB; Variance = 0.40 MB;                   SD = 0.63 MB; Median 897.78 MB
# each ∎ represents a count of 1
  806.6628 -   896.7108 MB [   2]: ∎∎
  896.7108 -   897.1296 MB [   9]: ∎∎∎∎∎∎∎∎∎
  897.1296 -   897.5484 MB [  21]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  897.5484 -   897.9672 MB [  34]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  897.9672 -   898.3860 MB [  15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  898.3860 -   898.8048 MB [   4]: ∎∎∎∎
  898.8048 -   899.2236 MB [   1]: ∎
  899.2236 -   899.6424 MB [   3]: ∎∎∎
  899.6424 -   900.0612 MB [   0]:
  900.0612 -   900.4800 MB [   1]: ∎
The requested memory was 2000MB.

========== Elapsed Time ==========
# NumSamples = 90; Min = 00:03:25.0; Max = 00:07:24.0
# Mean = 00:05:45.0; SD = 00:01:39.0; Median 00:06:44.0
# each ∎ represents a count of 1
00:03:5.0  - 00:03:48.0 [  30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
00:03:48.0 - 00:04:11.0 [   0]:
00:04:11.0 - 00:04:34.0 [   0]:
00:04:34.0 - 00:04:57.0 [   0]:
00:04:57.0 - 00:05:20.0 [   0]:
00:05:20.0 - 00:05:43.0 [   0]:
00:05:43.0 - 00:06:6.0  [   0]:
00:06:6.0  - 00:06:29.0 [   0]:
00:06:29.0 - 00:06:52.0 [  30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
00:06:52.0 - 00:07:15.0 [  28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
********************************************************************************
The requested runtime was 01:00:00.
The average runtime was 00:05:45.0.
Requesting less time would allow jobs to run more quickly.
********************************************************************************

References

[1] Slurm Documentation - sinfo

[2] Slurm Documentation - sacct

[3] ycrc/seff-array repo at GitHub