Monitoring

There are several ways of monitoring your jobs, and the load on the cluster.

Job Monitoring on Picotte♯

squeue♯

squeue[1] shows information about running jobs. To view your own jobs:

squeue --me

scontrol♯

scontrol[2] shows or modifies various Slurm configurations, including jobs and job steps. To view detailed information about a job:

scontrol show jobjob_id

Example:

[juser@picotte001 ~]$ scontrol show job 567890
JobId=567890 JobName=something.sh
   UserId=juser(1234) GroupId=juser(1234) MCS_label=N/A
   Priority=2140 Nice=0 Account=someprj QOS=normal WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:14:46 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2021-05-24T13:19:08 EligibleTime=2021-05-24T13:19:08
   AccrueTime=2021-05-24T13:19:08
   StartTime=2021-05-24T13:19:09 EndTime=2021-05-24T17:19:09 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-24T13:19:09
   Partition=def AllocNode:Sid=picotte001:50418
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node031
   BatchHost=node031
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=3900M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/ifs/groups/someGrp/juser/submit.sh
   WorkDir=/ifs/groups/someGrp/je534/juser/
   StdErr=/ifs/groups/someGrp/je534/juser/log.out
   StdIn=/dev/null
   StdOut=/ifs/groups/someGrp/juser/log.out
   Power=
   MailUser=juser MailType=NONE

sacct♯

sacct[3] shows information for completed jobs. It has many options.

Use sacct_detail from the Slurm Utility Commands module. Example:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 12345
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
             12345_1     Run.sh     juser         bm       bigmem002   00:06:18  COMPLETED      0:0                                          cpu=12,node=1
       12345_1.batch      batch                            bigmem002   00:06:18  COMPLETED      0:0   4843835K   5222088K              cpu=12,mem=0,node=1
      12345_1.extern     extern                            bigmem002   00:06:18  COMPLETED      0:0       488K    153196K                    cpu=12,node=1
             12345_2     Run.sh     juser         bm       bigmem002   00:05:59  COMPLETED      0:0                                          cpu=12,node=1
       12345_2.batch      batch                            bigmem002   00:05:59  COMPLETED      0:0   4180691K   4558920K              cpu=12,mem=0,node=1
      12345_2.extern     extern                            bigmem002   00:05:59  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_3     Run.sh     juser         bm       bigmem002   00:03:53 OUT_OF_ME+    0:125                                          cpu=12,node=1
       12345_3.batch      batch                            bigmem002   00:03:53 OUT_OF_ME+    0:125   1539337K   1917516K              cpu=12,mem=0,node=1
      12345_3.extern     extern                            bigmem002   00:03:53  COMPLETED      0:0       142K    153196K                    cpu=12,node=1
             12345_4     Run.sh     juser         bm       bigmem002   00:03:13  COMPLETED      0:0                                          cpu=12,node=1
       12345_4.batch      batch                            bigmem002   00:03:13  COMPLETED      0:0   1531823K   1909808K              cpu=12,mem=0,node=1
      12345_4.extern     extern                            bigmem002   00:03:13  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_5     Run.sh     juser         bm       bigmem002   00:02:33  COMPLETED      0:0                                          cpu=12,node=1
       12345_5.batch      batch                            bigmem002   00:02:33  COMPLETED      0:0   1757247K   2135724K              cpu=12,mem=0,node=1
      12345_5.extern     extern                            bigmem002   00:02:33  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_6     Run.sh     juser         bm       bigmem002   00:02:14  COMPLETED      0:0                                          cpu=12,node=1
       12345_6.batch      batch                            bigmem002   00:02:14  COMPLETED      0:0    373194K   1467748K              cpu=12,mem=0,node=1
      12345_6.extern     extern                            bigmem002   00:02:14  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_7     Run.sh     juser         bm       bigmem002   00:01:54  COMPLETED      0:0                                          cpu=12,node=1
       12345_7.batch      batch                            bigmem002   00:01:54  COMPLETED      0:0    137955K   1470872K              cpu=12,mem=0,node=1
      12345_7.extern     extern                            bigmem002   00:01:54  COMPLETED      0:0       494K    153196K                    cpu=12,node=1
             12345_8     Run.sh     juser         bm       bigmem002   00:01:32  COMPLETED      0:0                                          cpu=12,node=1
       12345_8.batch      batch                            bigmem002   00:01:32  COMPLETED      0:0    872741K   1451004K              cpu=12,mem=0,node=1
      12345_8.extern     extern                            bigmem002   00:01:32  COMPLETED      0:0       456K    153196K                    cpu=12,node=1
             12345_9     Run.sh     juser         bm       bigmem002   00:01:04  COMPLETED      0:0                                          cpu=12,node=1
       12345_9.batch      batch                            bigmem002   00:01:04  COMPLETED      0:0    185753K   1452356K              cpu=12,mem=0,node=1
      12345_9.extern     extern                            bigmem002   00:01:04  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
            12345_10     Run.sh     juser         bm       bigmem002   00:00:59  COMPLETED      0:0                                          cpu=12,node=1
      12345_10.batch      batch                            bigmem002   00:00:59  COMPLETED      0:0    160845K   1490248K              cpu=12,mem=0,node=1
     12345_10.extern     extern                            bigmem002   00:00:59  COMPLETED      0:0       137K    153196K                    cpu=12,node=1

NOTE:

In the sacct output, look at the ".extern" line. An exit code of "0:0" means the job completed successfully. "OUT_OF_MEMORY" states in lines before the ".extern" line can be safely disregarded. For example, see the lines for job task 12345_3 in the output above.

Direct ssh to node, or srun♯

You can start a shell on a node where you have a job running, either by direct ssh, or by using srun.

ssh - on the login node picotte001

sshnodename

srun - on the login node picotte001

srun --jobidJOBID--pty /bin/bash

Once on the node, you can use top(1) or ps(1) to view information about the job. See the appropriate man pages for these standard Linux programs.

sreport♯

sreport can provide utilization metrics, including SUs consumed.

Post-job Analysis♯

The Bright Cluster Manager User Portal has a tool which can query various metrics about past jobs:

https://picottemgmt.urcf.drexel.edu/userportal/#/accounting

Log in with your URCF account.

References♯

[1] Slurm Documentation - squeue

[2] Slurm Documentation - scontrol

[3] Slurm Documentation - sacct