Monitoring
There are several ways of monitoring your jobs, and the load on the cluster.
Job Monitoring on Picotte♯
squeue♯
squeue
[1] shows information about running jobs. To view your own jobs:
squeue --me
scontrol♯
scontrol
[2] shows or modifies various Slurm configurations, including
jobs and job steps. To view detailed information about a job:
scontrol show job
job_id
Example:
[juser@picotte001 ~]$ scontrol show job 567890
JobId=567890 JobName=something.sh
UserId=juser(1234) GroupId=juser(1234) MCS_label=N/A
Priority=2140 Nice=0 Account=someprj QOS=normal WCKey=*
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:14:46 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2021-05-24T13:19:08 EligibleTime=2021-05-24T13:19:08
AccrueTime=2021-05-24T13:19:08
StartTime=2021-05-24T13:19:09 EndTime=2021-05-24T17:19:09 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-24T13:19:09
Partition=def AllocNode:Sid=picotte001:50418
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node031
BatchHost=node031
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,node=1,billing=16
Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
MinCPUsNode=16 MinMemoryCPU=3900M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/ifs/groups/someGrp/juser/submit.sh
WorkDir=/ifs/groups/someGrp/je534/juser/
StdErr=/ifs/groups/someGrp/je534/juser/log.out
StdIn=/dev/null
StdOut=/ifs/groups/someGrp/juser/log.out
Power=
MailUser=juser MailType=NONE
sacct♯
sacct
[3] shows information for completed jobs. It has many options.
Use sacct_detail
from the Slurm Utility
Commands module. Example:
[juser@picotte001 ~]$ module load slurm_util
[juser@picotte001 ~]$ sacct_detail -j 12345
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS MaxVMSize AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- -------------------------------- --------
12345_1 Run.sh juser bm bigmem002 00:06:18 COMPLETED 0:0 cpu=12,node=1
12345_1.batch batch bigmem002 00:06:18 COMPLETED 0:0 4843835K 5222088K cpu=12,mem=0,node=1
12345_1.extern extern bigmem002 00:06:18 COMPLETED 0:0 488K 153196K cpu=12,node=1
12345_2 Run.sh juser bm bigmem002 00:05:59 COMPLETED 0:0 cpu=12,node=1
12345_2.batch batch bigmem002 00:05:59 COMPLETED 0:0 4180691K 4558920K cpu=12,mem=0,node=1
12345_2.extern extern bigmem002 00:05:59 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_3 Run.sh juser bm bigmem002 00:03:53 OUT_OF_ME+ 0:125 cpu=12,node=1
12345_3.batch batch bigmem002 00:03:53 OUT_OF_ME+ 0:125 1539337K 1917516K cpu=12,mem=0,node=1
12345_3.extern extern bigmem002 00:03:53 COMPLETED 0:0 142K 153196K cpu=12,node=1
12345_4 Run.sh juser bm bigmem002 00:03:13 COMPLETED 0:0 cpu=12,node=1
12345_4.batch batch bigmem002 00:03:13 COMPLETED 0:0 1531823K 1909808K cpu=12,mem=0,node=1
12345_4.extern extern bigmem002 00:03:13 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_5 Run.sh juser bm bigmem002 00:02:33 COMPLETED 0:0 cpu=12,node=1
12345_5.batch batch bigmem002 00:02:33 COMPLETED 0:0 1757247K 2135724K cpu=12,mem=0,node=1
12345_5.extern extern bigmem002 00:02:33 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_6 Run.sh juser bm bigmem002 00:02:14 COMPLETED 0:0 cpu=12,node=1
12345_6.batch batch bigmem002 00:02:14 COMPLETED 0:0 373194K 1467748K cpu=12,mem=0,node=1
12345_6.extern extern bigmem002 00:02:14 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_7 Run.sh juser bm bigmem002 00:01:54 COMPLETED 0:0 cpu=12,node=1
12345_7.batch batch bigmem002 00:01:54 COMPLETED 0:0 137955K 1470872K cpu=12,mem=0,node=1
12345_7.extern extern bigmem002 00:01:54 COMPLETED 0:0 494K 153196K cpu=12,node=1
12345_8 Run.sh juser bm bigmem002 00:01:32 COMPLETED 0:0 cpu=12,node=1
12345_8.batch batch bigmem002 00:01:32 COMPLETED 0:0 872741K 1451004K cpu=12,mem=0,node=1
12345_8.extern extern bigmem002 00:01:32 COMPLETED 0:0 456K 153196K cpu=12,node=1
12345_9 Run.sh juser bm bigmem002 00:01:04 COMPLETED 0:0 cpu=12,node=1
12345_9.batch batch bigmem002 00:01:04 COMPLETED 0:0 185753K 1452356K cpu=12,mem=0,node=1
12345_9.extern extern bigmem002 00:01:04 COMPLETED 0:0 143K 153196K cpu=12,node=1
12345_10 Run.sh juser bm bigmem002 00:00:59 COMPLETED 0:0 cpu=12,node=1
12345_10.batch batch bigmem002 00:00:59 COMPLETED 0:0 160845K 1490248K cpu=12,mem=0,node=1
12345_10.extern extern bigmem002 00:00:59 COMPLETED 0:0 137K 153196K cpu=12,node=1
NOTE:
- In the
sacct
output, look at the ".extern
" line. An exit code of "0:0
" means the job completed successfully. "OUT_OF_MEMORY
" states in lines before the ".extern
" line can be safely disregarded. For example, see the lines for job task12345_3
in the output above.
Direct ssh to node, or srun♯
You can start a shell on a node where you have a job running, either by direct ssh, or by using srun.
- ssh - on the login node
picotte001
ssh
nodename
- srun - on the login node
picotte001
srun --jobid
JOBID
--pty /bin/bash
Once on the node, you can use top(1)
or ps(1)
to view information
about the job. See the appropriate man pages for these standard Linux
programs.
sreport♯
sreport can provide utilization metrics, including SUs consumed.
Post-job Analysis♯
The Bright Cluster Manager User Portal has a tool which can query various metrics about past jobs:
https://picottemgmt.urcf.drexel.edu/userportal/#/accounting
Log in with your URCF account.
See also: Slurm Utility Commands
See Also♯
References♯
[1] Slurm Documentation - squeue