Skip to content

Monitoring

There are several ways of monitoring your jobs, and the load on the cluster.

Job Monitoring on Picotte

squeue

squeue[1] shows information about running jobs. To view your own jobs:

squeue --me

scontrol

scontrol[2] shows or modifies various Slurm configurations, including jobs and job steps. To view detailed information about a job:

scontrol show jobjob_id

Example:

[juser@picotte001 ~]$ scontrol show job 567890
JobId=567890 JobName=something.sh
   UserId=juser(1234) GroupId=juser(1234) MCS_label=N/A
   Priority=2140 Nice=0 Account=someprj QOS=normal WCKey=*
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:14:46 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2021-05-24T13:19:08 EligibleTime=2021-05-24T13:19:08
   AccrueTime=2021-05-24T13:19:08
   StartTime=2021-05-24T13:19:09 EndTime=2021-05-24T17:19:09 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-24T13:19:09
   Partition=def AllocNode:Sid=picotte001:50418
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node031
   BatchHost=node031
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=3900M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/ifs/groups/someGrp/juser/submit.sh
   WorkDir=/ifs/groups/someGrp/je534/juser/
   StdErr=/ifs/groups/someGrp/je534/juser/log.out
   StdIn=/dev/null
   StdOut=/ifs/groups/someGrp/juser/log.out
   Power=
   MailUser=juser MailType=NONE

sacct

sacct[3] shows information for completed jobs. It has many options.

Use sacct_detail from the Slurm Utility Commands module. Example:

[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 12345
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
             12345_1     Run.sh     juser         bm       bigmem002   00:06:18  COMPLETED      0:0                                          cpu=12,node=1
       12345_1.batch      batch                            bigmem002   00:06:18  COMPLETED      0:0   4843835K   5222088K              cpu=12,mem=0,node=1
      12345_1.extern     extern                            bigmem002   00:06:18  COMPLETED      0:0       488K    153196K                    cpu=12,node=1
             12345_2     Run.sh     juser         bm       bigmem002   00:05:59  COMPLETED      0:0                                          cpu=12,node=1
       12345_2.batch      batch                            bigmem002   00:05:59  COMPLETED      0:0   4180691K   4558920K              cpu=12,mem=0,node=1
      12345_2.extern     extern                            bigmem002   00:05:59  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_3     Run.sh     juser         bm       bigmem002   00:03:53 OUT_OF_ME+    0:125                                          cpu=12,node=1
       12345_3.batch      batch                            bigmem002   00:03:53 OUT_OF_ME+    0:125   1539337K   1917516K              cpu=12,mem=0,node=1
      12345_3.extern     extern                            bigmem002   00:03:53  COMPLETED      0:0       142K    153196K                    cpu=12,node=1
             12345_4     Run.sh     juser         bm       bigmem002   00:03:13  COMPLETED      0:0                                          cpu=12,node=1
       12345_4.batch      batch                            bigmem002   00:03:13  COMPLETED      0:0   1531823K   1909808K              cpu=12,mem=0,node=1
      12345_4.extern     extern                            bigmem002   00:03:13  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_5     Run.sh     juser         bm       bigmem002   00:02:33  COMPLETED      0:0                                          cpu=12,node=1
       12345_5.batch      batch                            bigmem002   00:02:33  COMPLETED      0:0   1757247K   2135724K              cpu=12,mem=0,node=1
      12345_5.extern     extern                            bigmem002   00:02:33  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_6     Run.sh     juser         bm       bigmem002   00:02:14  COMPLETED      0:0                                          cpu=12,node=1
       12345_6.batch      batch                            bigmem002   00:02:14  COMPLETED      0:0    373194K   1467748K              cpu=12,mem=0,node=1
      12345_6.extern     extern                            bigmem002   00:02:14  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
             12345_7     Run.sh     juser         bm       bigmem002   00:01:54  COMPLETED      0:0                                          cpu=12,node=1
       12345_7.batch      batch                            bigmem002   00:01:54  COMPLETED      0:0    137955K   1470872K              cpu=12,mem=0,node=1
      12345_7.extern     extern                            bigmem002   00:01:54  COMPLETED      0:0       494K    153196K                    cpu=12,node=1
             12345_8     Run.sh     juser         bm       bigmem002   00:01:32  COMPLETED      0:0                                          cpu=12,node=1
       12345_8.batch      batch                            bigmem002   00:01:32  COMPLETED      0:0    872741K   1451004K              cpu=12,mem=0,node=1
      12345_8.extern     extern                            bigmem002   00:01:32  COMPLETED      0:0       456K    153196K                    cpu=12,node=1
             12345_9     Run.sh     juser         bm       bigmem002   00:01:04  COMPLETED      0:0                                          cpu=12,node=1
       12345_9.batch      batch                            bigmem002   00:01:04  COMPLETED      0:0    185753K   1452356K              cpu=12,mem=0,node=1
      12345_9.extern     extern                            bigmem002   00:01:04  COMPLETED      0:0       143K    153196K                    cpu=12,node=1
            12345_10     Run.sh     juser         bm       bigmem002   00:00:59  COMPLETED      0:0                                          cpu=12,node=1
      12345_10.batch      batch                            bigmem002   00:00:59  COMPLETED      0:0    160845K   1490248K              cpu=12,mem=0,node=1
     12345_10.extern     extern                            bigmem002   00:00:59  COMPLETED      0:0       137K    153196K                    cpu=12,node=1

NOTE:

  • In the sacct output, look at the ".extern" line. An exit code of "0:0" means the job completed successfully. "OUT_OF_MEMORY" states in lines before the ".extern" line can be safely disregarded. For example, see the lines for job task 12345_3 in the output above.

Direct ssh to node, or srun

You can start a shell on a node where you have a job running, either by direct ssh, or by using srun.

  • ssh - on the login node picotte001

sshnodename

  • srun - on the login node picotte001

srun --jobidJOBID--pty /bin/bash

Once on the node, you can use top(1) or ps(1) to view information about the job. See the appropriate man pages for these standard Linux programs.

sreport

sreport can provide utilization metrics, including SUs consumed.

Post-job Analysis

The Bright Cluster Manager User Portal has a tool which can query various metrics about past jobs:

https://picottemgmt.urcf.drexel.edu/userportal/#/accounting

Log in with your URCF account.

See also: Slurm Utility Commands

See Also

References

[1] Slurm Documentation - squeue

[2] Slurm Documentation - scontrol

[3] Slurm Documentation - sacct