Recommendations for Jobs
This article presumes familiarity with the information in Writing Job Scripts. HPC is like a manual transmission race car, and there are no one-size-fits-all solutions.
All Jobs
These two limits, and one resource should always be requested:
h_rt
-- max. amount of wall clock time the job should runm_mem_free
-- min. amount of free memory per slot that a node must have in order to begin running the jobh_vmem
-- max. amount of memory (technically, address space) per slot that can be consumed by a job
Bear in mind the installed hardware when making resource requests. See below.
Filesystems
The filesystem which your job uses can also affect how fast your job runs. Some jobs are disk i/o (input/output) intensive, some are not. There are three types of filesystems in use:
- local hard drive (
/local/scratch
)- the fastest, but limited only to each node itself. Set in the environment variablesTMPDIR
andTMP
. Many programs will look for one of these environment variables, but many do not. You must check your software. - NFS filesystem - this is where your home and group directories are hosted. Avoid doing a lot of i/o to this filesystem when running jobs, unless absolutely necessary. Some software do not support running on a Lustre filesystem, e.g. Abaqus, and so this must be used, instead.
- Lustre filesystem (
/scratch
=/lustre/scratch
) this is the fast parallel scratch filesystem. If you have many processes which need to write different files in the same directory structure, use this. Limitation: there is no file-locking, i.e. this does not support multiple processes trying to write to the same file.
Serial Jobs
- Do not use a parallel environment (PE) because serial jobs are not parallel
- If you have many (say, more than 25) serial jobs to run, use a job array.
Parallel Jobs
Processor cores are typically grouped in multiples of 4. So, the most efficient layouts would involve splitting the jobs such that no fewer than 4 processes run on each node.[1]
Inter-node communication is very slow compared to intra-node communication. So, the fewer nodes used in a job, the better. That may mean waiting in the pending list for enough resources to accumulate, but the savings in run time may greatly outweigh the extra wait time.
If you have a multi-node job, it is best to completely use all slots on all nodes, e.g. a 256-slot job split over 16 Intel nodes (16 slots per node) is better than being split over 256 nodes (1 slot per node).
Host Vendor/Microarchitecture
- Run on exclusively Intel, or exclusively AMD nodes. There are two
ways to do this:
- specify the host group:
all.q@@intelhosts
orall.q@@amdhosts
- specify the vendor resource:
#$ -l vendor=intel
, or#$ -l vendor=amd
- specify the host group:
MPI Implementations
- Use one of the locally compiled MPI implementations, which include
Grid Engine integration. These are provided by modules named
"
proteus-*
" - Use Intel MPI with Intel compilers -- this MPI implementation also includes Grid Engine integration.
- Grid Engine integration means that the mpirun command line need not specify number of processes: the environment will be read mpirun in order to determine which hosts and how many processes per host are involved.
Parallel Environments
Use the "fixedNN
" parallel environments for multi-node jobs:
- To run on Intel nodes, use the "
fixed16
" PE. - To run on AMD nodes, use the "
fixed64
" PE.
Use the "shm
" parallel environment for single-node jobs. This may
also be called "shared-memory parallel".
Currently, all the Intel nodes have 16 slots, and all the AMD nodes have
64 slots. If your job requires less than or equal to those numbers of
slots, use the shm
parallel environment. E.g.:
#$ -pe shm 8
There may be specific PEs for certain software which have non-standard ways of specifying the environment.
Resource Reservation
If your job uses >= 64 slots:
#$ -R y
This directs UGE to accumulate resources as the become idle in order to fulfill requirements of the job.
See Also
References
[1] Colorado School of Mines CSCI 341 - Yong Joseph Bakos (PDF file)