Slurm Reference
This article gives a quick overview of the most commonly used SLURM commands. The official SLURM documentation also covers these commands comprehensively.
Commands♯
sbatch♯
sbatch is used to submit job scripts to Slurm for later
execution. The script may contain multiple srun commands in order to
launch parallel tasks.
Options have a short (single-letter) and a long form. The short form does not need the "=" sign, while the long form does. E.g. "-N 4" is equivalent to "--nodes=4"
Some common options are listed below. This summary is incomplete and not definitive. Consult the official documentation, either the man page for sbatch, or the web version:
| Option | Meaning | Example | 
|---|---|---|
| -A, --account=account | Identifies which account the job should be charged to. | -A somethingPrj | 
| -D, --chdir=dir | Set the work directory to the specified directory before executing the job script. If unspecified, the directory where the sbatchcommand is issued is used. | -D /ifs/groups/somethGrp/myname/ | 
| -p, --partition=part | Specifies what partition the job will run on. If unspecified, the " def" partition will be used. | -p def | 
| -N, --nodes=numNodes | Specify the number of nodes to be allocated for the job | -N 16 | 
| -n, --ntasks=count | sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option tells Slurm that job steps within the allocation will launch a maximum of count tasks, and to provide sufficient resources. The default is one task per node. | --ntasks=4 | 
| -c, --cpus-per-task=count | Advise Slurm that the job will require count number of CPU cores per task. (The default value is 1.) | --cpus-per-task=12 | 
| -t, --time=hh:mm:ss | Specify the amount of time to allocate for the job | -t 24:00:00 | 
| --mem=size[units] | Specify the real memory required per node. Default units are megabytes. Different units can be specified using one of the suffixes "K", "M", "G", or "T". NOTE: The " --mem", "--mem-per-cpu", and "--mem-per-gpu" options are mutually exclusive. | --mem=8GB | 
| --mem-per-cpu=size[units] | Minimum memory required per allocated CPU. Default units are megabytes. The default value is 3900 MB. | |
| --mem-per-gpu=size[units] | Minimum (node) memory required per allocated GPU. Default units are megabytes. | |
| --mail-type= | Notify the user by mail when an event type occurs. Acceptable types are: NONE, BEGIN, END, FAIL, REQUEUE, STAGE_OUT, ALL, and ARRAY_TASKS. | --mail-type=BEGIN,END,FAIL | 
| --mail-user=user@host | Set email address to receive job status. N.B. Drexel's Outlook mail servers will drop Picotte emails, so use an external email address. | --mail-user=juser@gmail.com | 
| -i, --input=input_file | input_file will be used as input for the job | -i /ifs/groups/myGrp/data/awesomedata.csv | 
| -o, --output=output_file | job output will be written to output_file | -o /ifs/groups/myGrp/juser/job_outputs/awesomejob.out | 
| -e, --error=error_file | job error will be written to error_file | -e /ifs/groups/myGrp/juser/job_outputs/sadjob.err | 
| --gres=gpu:N | Request N GPU devices (cards). N can be up to 4. | --gres=gpu:2 | 
Options may be passed on the command line:
[juser@picotte001 ~]$ sbatch -N 4 -t 12:00:00 --mem=2GB myjob.sh
or set in the job script:
#SBATCH -N 4
#SBATCH -t 12:00:00
#SBATCH --mem=2GB
Options passed on the command line override the settings embedded in the job script.
srun♯
srun is used to to submit a job for execution or initiate job
steps in real-time. Example, get a shell on a compute node:
[juser@picotte001 ~]$ srun -N 1 --mem=32G --pty /bin/bash
[juser@node001 ~] $
(Or you can use "/bin/bash -l" which gives a new login shell; useful
if you have separated settings into .bash_profile for environment
variables, and other things into .bashrc.)
Example: run Matlab (terminal UI) using 48 cores on a node:
[juser@picotte001 ~]$ module load matlab
[juser@picotte001 ~]$ srun --nodes=1 --ntasks=1 --cpus-per-task=48 --mem=32G --time=00:15:00 --pty matlab -nodisplay -nodesktop -nosplash -noFigureWindows

Unfortunately, there are unresolved issues with running MPI programs
with srun, which means one cannot do:
[juser@picotte001 ~]$ srun ... some_mpi_program
To run MPI programs, a batch script must be used, executing the program
with "mpirun".
squeue♯
squeue is used to report the state of jobs or job states. The
option "-j" will show the status on a provided job.
[juser@picotte001 ~]$ squeue -j 127
JOBid   PARTITION   NAME      USER   ST      TIME  NODES NODELIST(REASON)
127       all         test.sh  juser  R      1:49:21     1       r569
The option --me will show all your own jobs.
The option -u will show all jobs for a given user.
[juser@picotte001 ~]$ squeue -u juser
JOBid   PARTITION   NAME      USER   ST      TIME  NODES NODELIST(REASON)
127       all         test.sh  juser  R      1:49:21     1       r569
128       all         test.sh  juser  R      1:50:08     1       r569
scancel♯
scancel is used to cancel a pending or running job or job step.
[juser@picotte001 ~]$ sbatch testJob.sh
Submitted batch job 127
[juser@picotte001 ~]$ squeue -u juser
                JOBid PARTITION     NAME      USER ST        TIME  NODES NODELIST(REASON)
              127             RM testJob.    cwf25 PD        0:00       1 (None)
[juser@picotte001 ~]$ scancel 127
[juser@picotte001 ~]$ squeue -u juser
                JOBid PARTITION     NAME      USER ST        TIME  NODES NODELIST(REASON)
[juser@picotte001 ~]$
scontrol♯
scontrol is used to view or modify
Slurm configuration, and jobs (among other things). scontrol has a wide variety
of uses, some of which are demonstrated below.
scontrol can be used to get information on the nodes Slurm manages. For an
individual node use the command scontrol show node NODE_ID. Using scontrol
show nodes will show all nodes in the cluster with each node's information
being displayed in the below format.
[juser@picotte001 ~]$ scontrol show node node013
NodeName=node013 Arch=x86_64 CoresPerSocket=12
    CPUAlloc=48 CPUTot=48 CPULoad=7.76
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=node013 NodeHostName=node013 Version=20.02.6
    OS=Linux 4.18.0-147.el8.x86_64 #1 SMP Thu Sep 26 15:52:44 UTC 2019
    RealMemory=192000 AllocMem=0 FreeMem=178621 Sockets=4 Boards=1
    State=ALLOCATED ThreadsPerCore=1 TmpDisk=174864 Weight=1 Owner=N/A MCS_label=N/A
    Partitions=def,long
    BootTime=2021-01-27T14:42:33 SlurmdStartTime=2021-01-27T14:44:20
    CfgTRES=cpu=48,mem=187.50G,billing=48
    AllocTRES=cpu=48
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
In order to get information about a job use the command scontrol show job
JOB_ID. You can use scontrol show jobs to show info about all running jobs.
Here is an example of a job running on a GPU node:
[juser@picotte001 ~]$ scontrol show job 127
JobId=127 JobName=somejob
    UserId=juser(1002) GroupId=dwc62(1002) MCS_label=N/A
    Priority=2012 Nice=0 Account=someprj QOS=normal WCKey=*
    JobState=RUNNING Reason=None Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:46 TimeLimit=1-00:00:00 TimeMin=N/A
    SubmitTime=2021-04-29T11:48:35 EligibleTime=2021-04-29T11:48:35
    AccrueTime=2021-04-29T11:48:35
    StartTime=2021-04-29T11:48:35 EndTime=2021-04-30T11:48:35 Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-29T11:48:35
    Partition=gpu AllocNode:Sid=picotte001:27308
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=gpu001
    BatchHost=gpu001
    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    TRES=cpu=1,node=1,billing=43,gres/gpu=1
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/home/juser/somejob.sh
    WorkDir=/home/juser
    StdErr=/home/juser/slurm-127.err
    StdIn=/dev/null
    StdOut=/home/juser/slurm-127.out
    Power=
    MemPerTres=gpu:40960
    TresPerNode=gpu:1
    MailUser=juser MailType=NONE
scontrol can also be used to hold and release jobs.
[juser@picotte001 ~]$ sbatch testJob.sh
Submitted batch job 128
[juser@picotte001 ~]$ scontrol hold 128
[juser@picotte001 ~]$ squeue -u juser
                    JOBid PARTITION     NAME      USER ST        TIME  NODES NODELIST(REASON)
            128             def testJob.    juser PD         0:00       1 (JobHeldUser)
[juser@picotte001 ~]$ scontrol release 128
[juser@picotte001 ~]$ squeue -u juser
                    JOBid PARTITION     NAME      USER ST        TIME  NODES NODELIST(REASON)
            128             def testJob.    juser   R         0:06       1 node023
scontrol can also modify aspects of a job, like run time, and the task throttle of an array job. To modify the throttle on an array job:
scontrol update JobId=`<jobid>` ArrayTaskThrottle=`<count>`
e.g.
scontrol update JobID=12345678 ArrayTaskThrottle=50
This modifies the throttle, but leaves tasks which are already running alone. It just reduces the number of simultaneous tasks that can run.
One other thing that scontrol can do is display information about the partition. For a specific partition use "scontrol show partition part", where part is the name of the partition you want information on. Not specifying a partition will return information on all partitions managed by Slurm.
[juser@picotte001 ~]$ scontrol show partition def
PartitionName=def
    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
    AllocNodes=ALL Default=YES QoS=N/A
    DefaultTime=00:30:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
    MaxNodes=50 MaxTime=2-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
    Nodes=node[001-074]
    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
    OverTimeLimit=NONE PreemptMode=OFF
    State=UP TotalCPUs=3552 TotalNodes=74 SelectTypeParameters=NONE
    JobDefaults=(null)
    DefMemPerCPU=3900 MaxMemPerNode=UNLIMITED
    TRESBillingWeights=CPU=1.0,GRES/gpu=0,Mem=0
sinfo♯
sinfo is used to report on the state of the partitions and nodes
managed by Slurm.
[juser@picotte001 ~]$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
bm             up 21-00:00:0       2    idle bigmem[001-002]
def*           up 2-00:00:00       1     mix node002
def*           up 2-00:00:00      12   alloc node[001,014-024]
def*           up 2-00:00:00      61    idle node[003-013,025-074]
gpu            up 1-00:00:00       9     mix gpu[001-005,009-012]
gpu            up 1-00:00:00       3     idle gpu[006-008]
If you load the slurm_util[12] module, you will have the
sinfo_detail alias, which will produce output like:
[juser@picotte001 ~]$ sinfo_detail
NODELIST      NODES PART       STATE CPUS    S:C:T   MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
bigmem001         1 bm         mixed   48   4:12:1  1546000   174864      1   (null) none
bigmem002         1 bm          idle   48   4:12:1  1546000   174864      1   (null) none
gpu001            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu002            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu003            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu004            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu005            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu006            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu007            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu008            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu009            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu010            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu011            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
gpu012            1 gpu         idle   48   4:12:1   192000   174864      1   (null) none
node001           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node002           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node003           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node004           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node005           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node006           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node007           1 def*   allocated   48   4:12:1   192000   174864      1   (null) none
node008           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
node009           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
node010           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
...
node072           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
node073           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
node074           1 def*        idle   48   4:12:1   192000   174864      1   (null) none
sacct♯
sacct is used to report accounting information on active or
completed jobs.
Active job:
[juser@picotte001 ~]$ sbatch testJob.sh
Submitted batch job 127
[juser@picotte001 ~]$ sacct -j 127
           JobID    JobName  Partition    Account  AllocCPUS       State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    127             testJob.sh         RM      myGrp           28    RUNNING      0:0
    `
Completed job:
[juser@picotte001 python]$ sacct -j 127
           JobID    JobName  Partition    Account  AllocCPUS       State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    127             testJob.sh         RM      myGrp           28  COMPLETED      0:0
sstat♯
sstat shows status information on running jobs.
[juser@picotte001 ~]$ sstat --jobs 123456
... [very long output omitted] ...
sjstat♯
sjstat displays statistics of jobs. Documentation is via -h for a brief help
message, and the man page.
Basic usage:
[juser@picotte001 ~]$ sjstat
Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits
-------------------------------------------------------------
bm        1546000Mb    48      2      2      2
def*       192000Mb    48     74     74     69
gpu        192000Mb    48     12     12     12
long       192000Mb    48     74     74     69
gpulong    192000Mb    48     12     12     12
Running job data:
----------------------------------------------------------------------
JobID    User      Nodes Pool      Status        Used  Master/Other
----------------------------------------------------------------------
841713   mb3544        1 def       R         13:05:12  node001
841766   wc492         1 def       R             1:16  node001
841756   wc492         1 def       R          1:22:22  node001
841758   rs3597        1 def       R            58:51  node001
841551   aag99         1 def       R         19:41:48  node002
841550   aag99         1 def       R         19:42:05  node041
841513   aag99         1 def       R       1-00:44:17  node005
841514   aag99         1 def       R       1-00:44:17  node040
seff♯
seff is undocumented. It can report "efficiency" statistics, i.e. how much resource was used as a percentage of that resource requested. This works on completed jobs. If reported efficiency is low, the resource requests (amount of memory, number of CPUs) should be reduced in future runs.
Example - a job which requested 16 cores (slots) and 128 GB of memory:
[juser@picotte001 ~]$ seff 12345
Job ID: 12345
Cluster: picotte
User/Group: juser/juser
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB
salloc♯
salloc is used to allocate resources
of a job in real-time. It will spawn a new shell with appropriate Slurm
environment variable set. Once the resources have been allocated, you may ssh
directly to the nodes allocated.
sreport♯
sreport is used to generate reports of job usage and cluster
utilization for Slurm jobs saved to the Slurm Database, slurmdbd.
Utility Scripts and aliases♯
We also have installed a few SLURM helper scripts, which make it easier to get more comprehensive output that the defaults.
To use these local aliases, load the modulefile:
module load slurm_util
For details, see: Slurm Utility Commands.
Recommendations♯
Single-node, single-threaded jobs♯
Do not specify any sbatch options related to nodes, tasks, or CPUs. The default behavior for a job is that it is single-node and single-threaded.
Single-node, multi-threaded jobs♯
Rather than specify some cobination of values for
"ntasks/ntasks-per-node" and "cpus-per-task", just use
"cpus-per-task" ("ntasks/ntasks-per-node" will take on their default
value of 1).
For example, to run a multithreaded process on a single node using 48 threads (1 thread per core):
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
Multi-node (MPI) jobs♯
For MPI jobs that will run on multiple nodes, it may be useful map
"ntasks-per-node" to the number of MPI ranks, and the
"cpus-per-task" to the number of threads per rank.
E.g. a 4 node job which will run 12 ranks per node, and 4 threads per rank:
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#SBATCH --cpus-per-task=4
GPU jobs♯
GPU-enabled programs will always have some part of its computation using CPUs. Since there are 4 GPU devices in each GPU node, and there are 48 cores in each GPU node, a sensible default is:
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-gpu=12
N.B. "cpus-per-gpu" and "cpus-per-task" are mutually exclusive: if
one is set, the other must not be set.
Partitions♯
In Slurm, a partition is a grouping of nodes. In Picotte, partitions are defined for node types (standard compute, big memory, GPU), and also for run time (the "long" partitions are for jobs which run for longer than 48 hours).
| Partition | Type of nodes | Notes | 
|---|---|---|
| def | Standard compute (48-core, 187 GB RAM) | This is the default partition. Jobs which do not specify a partition are placed in "def". Up to 48 hours wallclock. | 
| bm | Big memory (48-core, 1.5 TB RAM) | Up to 504 hours wallclock. Must request at least 200 GiB memory per node. | 
| gpu | GPU compute (48-core, 187 GB RAM, 4x Nvidia Tesla V100) | GPU devices have to be requested with " --gres=gpu:N" Up to 36 hours wallclock. | 
| long | Standard compute | Long job version of "def". Up to 192 hours of wallclock. | 
| gpulong | GPU compute | Long job version of "gpu". Up to 192 hours of wallclock. | 
Debugging Problems with Jobs♯
If your job is not running try resubmitting the job with the option "--test-only". This will validate your job script and provide an estimate of when the job would run. This does not run the job.