Skip to content

Job Scheduling Policy

Share Tree a.k.a. Fair Share

Job scheduling ​is a complex problem. In Grid Engine on Proteus, when a job is submitted by a user, the job is placed in a pending list. Periodically, the scheduler will attempt to assign the highest priority jobs in the pending list to the most appropriate available resources. A similar process happens in Slurm on Picotte. The process of scheduling a job has two distinct stages:

  1. Job Selection - every job in the pending job list is assigned a priority (a scalar value), and the entire list is sorted in order of priority, highest priority first.
  2. Job Scheduling - this is where a job is assigned to a set of free resources. The system attempts to find suitable resources for the jobs in priority sequence.

The diagram below shows all the parameters which go into the calculation of a job's priority.

800px

The primary​​ scheduling algorithm used on Proteus is ShareTree (known also as Share, or Fair Share in other implementations) based on work by Kay and Lauder[1]​[2]. The algorithm maintains a balance between ensuring that users/groups get their entitlement of resources, and ensuring that compute resources are not unnecessarily idle. The algorithm allows groups to temporarily exceed their target entitlement if there are excess resources available, and scales the allocation back as resources become scarcer. It aims to ensure that resource allocation ratios between groups are as close to predefined targets over a "long" period. In order to do this, Share Tree maintains a memory of past resource usage, and this memory "decays" with time. The decay rate is specified by a "halftime" (i.e. half life) parameter.

The amount of short term excess allocation is specified by the "compensation factor", CF. The short-term excess allocation may not exceed CF times the target allocation. This prevents any one group from monopolizing the system in the short term, ensuring some free resources are available for new jobs to start.

Each group is assigned a target share, specified by a number of "tickets". (NB These are not "tickets" in the common sense, which are consumed. These "tickets" merely serve as a method of assigning relative resource allocations.) These tickets do not specify particular proportions of resources to which a group is entitled. The relative importance may vary during operation as system load varies. For a specific example, please see the Beginner's Guide to Sun Grid Engine 6.2.[3]

800px

Share Policy Parameters

  • Usage is based solely on CPU time.
  • Halftime (i.e. usage half life) = 168 hours ~ 7 days. After 30 days, the usage weight is down to ~5%
  • Compensation factor = 2.5

Override Policy

More than one scheduling algorithm may be at play at any time. One common application is if an external deadline or crunch time is known ahead of time: we may reserve some amount of resources in order to ensure timely completion of jobs under deadline. In such a case, an Override Ticket Policy may be put in place. Please contact URCF Support if you require increased priority to meet a deadline.

Backfilling

There may be times when a job waits for enough free resources to accumulate before it can run. This may hold up the queue, which may include some number of smaller jobs, where smaller means fewer required resources as well as shorter run time. In such a case, "backfilling" may come into play: these lower priority small jobs may start running without interfering with the projected completion time of the larger job. Doing this increases the total job throughput.

Proteus Scheduler

The graph below shows the scheduler parameters for Proteus as of 2014-04-01: center|800px

Picotte Scheduler

Current Limits

The following global limits are in place 2019-08-23:

  • a maximum of 1024 slots concurrently per project. (This limit may be adjusted without notice based on total workload on Picotte.)
  • jobs in the def partition may not run longer than 48 hours
    • for longer-running jobs, use either the long or gpulong partitions

Changes may be made without notification.

Exceptions can be made to accommodate deadlines (publications, conferences, etc): please contact URCF Support

Picotte Scheduler

Picotte's Slurm job scheduler uses a similar algorithm, which they call Fair Tree.

References

[1] Kay, J. and Lauder, P., Commun. ACM, 31(1), Jan. 1988

[2] Mangalam, H., Pay for Priority

[3] Beginner's Guide to Sun Grid Engine 6.2: Installation and Configuration (Sep. 2008)