Recommendations for Jobs

This article presumes familiarity with the information in Writing Job Scripts. HPC is like a manual transmission race car, and there are no one-size-fits-all solutions.

All Jobs

These two limits, and one resource should always be requested:

h_rt -- max. amount of wall clock time the job should run
m_mem_free -- min. amount of free memory per slot that a node must have in order to begin running the job
h_vmem -- max. amount of memory (technically, address space) per slot that can be consumed by a job

Bear in mind the installed hardware when making resource requests. See below.

Filesystems

The filesystem which your job uses can also affect how fast your job runs. Some jobs are disk i/o (input/output) intensive, some are not. There are three types of filesystems in use:

local hard drive (/local/scratch)- the fastest, but limited only to each node itself. Set in the environment variables TMPDIR and TMP. Many programs will look for one of these environment variables, but many do not. You must check your software.
NFS filesystem - this is where your home and group directories are hosted. Avoid doing a lot of i/o to this filesystem when running jobs, unless absolutely necessary. Some software do not support running on a Lustre filesystem, e.g. Abaqus, and so this must be used, instead.
Lustre filesystem (/scratch = /lustre/scratch) this is the fast parallel scratch filesystem. If you have many processes which need to write different files in the same directory structure, use this. Limitation: there is no file-locking, i.e. this does not support multiple processes trying to write to the same file.

Serial Jobs

Do not use a parallel environment (PE) because serial jobs are not parallel
If you have many (say, more than 25) serial jobs to run, use a job array.

Parallel Jobs

Processor cores are typically grouped in multiples of 4. So, the most efficient layouts would involve splitting the jobs such that no fewer than 4 processes run on each node.[1]

Inter-node communication is very slow compared to intra-node communication. So, the fewer nodes used in a job, the better. That may mean waiting in the pending list for enough resources to accumulate, but the savings in run time may greatly outweigh the extra wait time.

If you have a multi-node job, it is best to completely use all slots on all nodes, e.g. a 256-slot job split over 16 Intel nodes (16 slots per node) is better than being split over 256 nodes (1 slot per node).

Host Vendor/Microarchitecture

Run on exclusively Intel, or exclusively AMD nodes. There are two ways to do this:
- specify the host group: all.q@@intelhosts or all.q@@amdhosts
- specify the vendor resource: #$ -l vendor=intel, or #$ -l vendor=amd

MPI Implementations

Use one of the locally compiled MPI implementations, which include Grid Engine integration. These are provided by modules named "proteus-*"
Use Intel MPI with Intel compilers -- this MPI implementation also includes Grid Engine integration.
Grid Engine integration means that the mpirun command line need not specify number of processes: the environment will be read mpirun in order to determine which hosts and how many processes per host are involved.

Parallel Environments

Use the "fixedNN" parallel environments for multi-node jobs:

To run on Intel nodes, use the "fixed16" PE.
To run on AMD nodes, use the "fixed64" PE.

Use the "shm" parallel environment for single-node jobs. This may also be called "shared-memory parallel".

Currently, all the Intel nodes have 16 slots, and all the AMD nodes have 64 slots. If your job requires less than or equal to those numbers of slots, use the shm parallel environment. E.g.:

#$ -pe shm 8

There may be specific PEs for certain software which have non-standard ways of specifying the environment.

Resource Reservation

If your job uses >= 64 slots:

#$ -R y

This directs UGE to accumulate resources as the become idle in order to fulfill requirements of the job.

References

[1] Colorado School of Mines CSCI 341 - Yong Joseph Bakos (PDF file)