Skip to content

Installed on Picotte is a small set of aliases and scripts which provide some useful defaults for commands like sacct and sinfo

To use, load the modulefile slurm_util:

` $ module load slurm_util

Commands/Aliases Available

sinfo_detail

Defines a set of parameters for sinfo.[1] This shows the current state of all nodes:

text [juser@picotte001 ~]$ sinfo_detail NODELIST NODES PART STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON bigmem001 1 bm draining 48 4:12:1 1546000 1724000 1 (null) Maintenance bigmem002 1 bm idle 48 4:12:1 1546000 1724000 1 (null) none gpu001 1 gpu idle 48 4:12:1 192000 1724000 1 (null) none gpu001 1 gpul idle 48 4:12:1 192000 1724000 1 (null) none gpu002 1 gpu idle 48 4:12:1 192000 1724000 1 (null) none ... gpu012 1 gpu down* 48 4:12:1 192000 1724000 1 (null) Not responding gpu012 1 gpul down* 48 4:12:1 192000 1724000 1 (null) Not responding node001 1 long allocated 48 4:12:1 192000 874000 1 (null) none node001 1 def* allocated 48 4:12:1 192000 874000 1 (null) none node002 1 long mixed 48 4:12:1 192000 874000 1 (null) none node002 1 def* mixed 48 4:12:1 192000 874000 1 (null) none node003 1 long mixed 48 4:12:1 192000 874000 1 (null) none node003 1 def* mixed 48 4:12:1 192000 874000 1 (null) none node004 1 long idle 48 4:12:1 192000 874000 1 (null) none node004 1 def* idle 48 4:12:1 192000 874000 1 (null) none ... node074 1 long idle 48 4:12:1 192000 874000 1 (null) none node074 1 def* idle 48 4:12:1 192000 874000 1 (null) none

sacct_detail

This defines a number of parameters to display in the output of sacct.[2]

NB the output of sacct may be inaccurate while a job is running. It is most useful after a job has concluded, successfully or not.

``` text [juser@picotte001 ~]$ sacct_detail -j 12345 JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE


           12345    program     juser        def         node003   02:58:35    RUNNING      0:0                               billing=12,cpu=12,node=1
     12345.batch      batch                              node003   02:58:35    RUNNING      0:0                                    cpu=12,mem=0,node=1
    12345.extern     extern                              node003   02:58:35    RUNNING      0:0                               billing=12,cpu=12,node=1

```

``` text [juser@picotte001 ~]$ sacct_detail -j 80273 JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE


         80273_1 hplcuda_8+     juser        gpu          gpu001   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_1.batch      batch                               gpu001   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_1.extern     extern                               gpu001   00:00:03  COMPLETED      0:0       496K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_2 hplcuda_8+     juser        gpu          gpu002   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_2.batch      batch                               gpu002   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_2.extern     extern                               gpu002   00:00:03  COMPLETED      0:0       484K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_3 hplcuda_8+     juser        gpu          gpu007   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_3.batch      batch                               gpu007   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_3.extern     extern                               gpu007   00:00:03  COMPLETED      0:0       444K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_4 hplcuda_8+     juser        gpu          gpu003   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_4.batch      batch                               gpu003   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_4.extern     extern                               gpu003   00:00:03  COMPLETED      0:0         8K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_5 hplcuda_8+     juser        gpu          gpu008   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_5.batch      batch                               gpu008   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_5.extern     extern                               gpu008   00:00:03  COMPLETED      0:0       482K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_6 hplcuda_8+     dwc62        gpu          gpu004   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_6.batch      batch                               gpu004   00:00:03  COMPLETED      0:0      1048K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_6.extern     extern                               gpu004   00:00:03  COMPLETED      0:0         4K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_7 hplcuda_8+     juser        gpu          gpu009   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_7.batch      batch                               gpu009   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_7.extern     extern                               gpu009   00:00:03  COMPLETED      0:0       480K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4
         80273_8 hplcuda_8+     juser        gpu          gpu005   00:00:03  COMPLETED      0:0                       billing=172,cpu=4,gres/gpu=4,no+    gpu:4
   80273_8.batch      batch                               gpu005   00:00:03  COMPLETED      0:0      1049K    153376K               cpu=4,mem=0,node=1    gpu:4
  80273_8.extern     extern                               gpu005   00:00:03  COMPLETED      0:0         8K    153196K billing=172,cpu=4,gres/gpu=4,no+    gpu:4

```

NOTE

  • In the sacct output, look at the ".extern" line. An exit code of "0:0" means the job completed successfully. "OUT_OF_MEMORY" states in lines before the ".extern" line can be safely disregarded. For example, see the lines for job task 12345_3 in the output below:

``` text [juser@picotte001 ~]$ module load slurm_util [juser@picotte001 ~]$ sacct_detail -j 12345 JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE


         12345_1     Run.sh     juser         bm       bigmem002   00:06:18  COMPLETED      0:0                                          cpu=12,node=1
   12345_1.batch      batch                            bigmem002   00:06:18  COMPLETED      0:0   4843835K   5222088K              cpu=12,mem=0,node=1
  12345_1.extern     extern                            bigmem002   00:06:18  COMPLETED      0:0       488K    153196K                    cpu=12,node=1
         12345_2     Run.sh     juser         bm       bigmem002   00:05:59  COMPLETED      0:0                                          cpu=12,node=1
   12345_2.batch      batch                            bigmem002   00:05:59  COMPLETED      0:0   4180691K   4558920K              cpu=12,mem=0,node=1
  12345_2.extern     extern                            bigmem002   00:05:59  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
         12345_3     Run.sh     juser         bm       bigmem002   00:03:53 OUT_OF_ME+    0:125                                          cpu=12,node=1
   12345_3.batch      batch                            bigmem002   00:03:53 OUT_OF_ME+    0:125   1539337K   1917516K              cpu=12,mem=0,node=1
  12345_3.extern     extern                            bigmem002   00:03:53  COMPLETED      0:0       142K    153196K                    cpu=12,node=1
         12345_4     Run.sh     juser         bm       bigmem002   00:03:13  COMPLETED      0:0                                          cpu=12,node=1

```

squeue_detail

squeue_detail shows a couple more fields than the default.

squeue_long

squeue_long additionally shows: MAX_CPUS, TRES (trackable resources) including billing.

seff_array

seff_array[3] is a utility created by the Yale Center for Research Computing. It runs seff on a job array, and displays histograms in the terminal. (seff is undocumented: it takes one argument, a job ID.)

Some advice on how to adjust your job is given in the output.

``` text [juser@picotte ~]$ seff_array 123456 ========== Max Memory Usage ==========

NumSamples = 90; Min = 896.29 MB; Max = 900.48 MB

Mean = 897.77 MB; Variance = 0.40 MB; SD = 0.63 MB; Median 897.78 MB

each ∎ represents a count of 1

806.6628 - 896.7108 MB [ 2]: ∎∎ 896.7108 - 897.1296 MB [ 9]: ∎∎∎∎∎∎∎∎∎ 897.1296 - 897.5484 MB [ 21]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 897.5484 - 897.9672 MB [ 34]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 897.9672 - 898.3860 MB [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 898.3860 - 898.8048 MB [ 4]: ∎∎∎∎ 898.8048 - 899.2236 MB [ 1]: ∎ 899.2236 - 899.6424 MB [ 3]: ∎∎∎ 899.6424 - 900.0612 MB [ 0]: 900.0612 - 900.4800 MB [ 1]: ∎ The requested memory was 2000MB.

========== Elapsed Time ==========

NumSamples = 90; Min = 00:03:25.0; Max = 00:07:24.0

Mean = 00:05:45.0; SD = 00:01:39.0; Median 00:06:44.0

each ∎ represents a count of 1

00:03:5.0 - 00:03:48.0 [ 30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 00:03:48.0 - 00:04:11.0 [ 0]: 00:04:11.0 - 00:04:34.0 [ 0]: 00:04:34.0 - 00:04:57.0 [ 0]: 00:04:57.0 - 00:05:20.0 [ 0]: 00:05:20.0 - 00:05:43.0 [ 0]: 00:05:43.0 - 00:06:6.0 [ 0]: 00:06:6.0 - 00:06:29.0 [ 0]: 00:06:29.0 - 00:06:52.0 [ 30]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 00:06:52.0 - 00:07:15.0 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎


The requested runtime was 01:00:00. The average runtime was 00:05:45.0. Requesting less time would allow jobs to run more quickly.


```

References

[1] Slurm Documentation - sinfo

[2] Slurm Documentation - sacct

[3] ycrc/seff-array repo at GitHub