Thread Affinity and NUMA

WORK IN PROGRESS -- not at all complete

Thread affinity, also known as CPU affinity, core affinity, core binding, thread binding, thread pinning, etc., is a way of assigning individual threads of execution to their own processor core, possibly giving better performance.

NUMA is 'Non-Uniform Memory Access, a memory design which allows faster access from processor cores located closer to the memory device.[1]

Explicit Binding

Grid Engine is able to explicitly specify the processor cores which are assigned to a job. It keeps track of which cores have been assigned to jobs, and only assigns those cores which are free for use.

However, there are different ways to specify the processor cores available. OpenMP has one format, MPI has a different format, and additionally, there may be vendor-specific formats. E.g., standard OpenMP syntax for explicit binding:

export OMP_PLACES="{0,4,8,12}"

and the equivalent for Intel Composer:

export KMP_AFFINITY="granularity=fine,proclist=[0,4,8,12],explicit"

While the PE_HOSTFILE contains:

ic03n01.cm.cluster 4 all.q@ic03n01.cm.cluster 0,0:0,4:1,0:1,4

i.e. ,:, ...

The trick is to translate Grid Engine's processor list to the appropriate format. This is done within the Parallel Environments, so that users do not have to handle the format transformation.

Overriding Defaults

Your job script may override the binding specifications by setting values for the appropriate environment variables. In doing so, you must be careful to use only those cores available to your job, as given by the contents of $PE_HOSTFILE.

Automatic Binding

cgroups limits jobs to have access only to those cores which have been assigned the job.

OpenMP

OpenMP is an application programming interface (API) for shared-memory parallelism, which usually means multi-threaded processing in commodity servers. OpenMP is merely an API specification.[2] The latest published API specification is version 4.0. Compiler vendors may decide to support some earlier versions of the API. A list of compilers and the respective supported OpenMP versions is here.

N.B. OpenMP is different from MPI (Message-Passing Interface), which may be known as "coarse-grained parallelism". OpenMP may also be termed "fine-grained parallelism". However, code may be written as hybrid OpenMP-MPI programs, with the node-local parallelism using OpenMP, and inter-node parallelism using MPI. See below.

GCC 4.8 with OpenMP

The GCC project's implementation of OpenMP is called GOMP.[3] GCC 4.8 (available using the module gcc/4.8.1) supports OpenMP 3.1. See .

To enable thread affinity, set two environment variables in your job script, which requests the "shm" parallel environment:[4]

#$ -pe shm 8 #$ -binding pe linear:8 ... export OMP_NUM_THREADS=$NSLOTS export OMP_PROC_BIND=true ./myprogram

Intel Composer XE with OpenMP

Intel provides support for three different OpenMP implementations: standard OpenMP using OMP_* environment variables, GNU OpenMP using GOMP_* environment variables, and its own using KMP_* environment variables.

Information about thread layout will be printed out by setting:

export KMP_AFFINITY="verbose,none"

However, setting KMP_AFFINITY will override all binding-related OMP_* environment variables, so OMP_PROC_BIND and OMP_PLACES will be ignored. Similarly, GOMP_CPU_AFFINITY will be ignored.

$PE_HOSTFILE from Grid Engine

See section on "-binding" in man page for qsub(1)

Job:

-pe intelmpi 62
-binding pe striding:2:8

Env. vars.

NHOSTS = 4
NSLOTS = 62
PE_HOSTFILE contents:

ic19n01.cm.cluster 16 debug.q@ic19n01.cm.cluster 0,0:1,0 ic19n04.cm.cluster 16 debug.q@ic19n04.cm.cluster 0,0:1,0 ic19n02.cm.cluster 16 debug.q@ic19n02.cm.cluster 0,0:1,0 ic10n04.cm.cluster 14 debug.q@ic10n04.cm.cluster 0,0:1,0

Format:

hostname n_slots queue_name socket,core:socket,core

OpenMP expects cores to be numbered sequentially at the leaf level.

Intel Nodes

hwloc-info:

Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) HostBridge L#0 PCIBridge PCI 8086:1521 Net L#0 "eth0" PCI 8086:1521 Net L#1 "eth1" PCIBridge PCIBridge PCI 1a03:2000 PCI 8086:1d02 Block L#2 "sda" NUMANode L#1 (P#1 32GB) Socket L#1 + L3 L#1 (20MB) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) HostBridge L#4 PCIBridge PCI 15b3:1003 Net L#3 "ib0" OpenFabrics L#4 "mlx4_0"

AMD Nodes

hwloc-ls:

Machine (256GB) Socket L#0 (64GB) NUMANode L#0 (P#0 32GB) + L3 L#0 (6144KB) L2 L#0 (2048KB) + L1i L#0 (64KB) L1d L#0 (16KB) + Core L#0 + PU L#0 (P#0) L1d L#1 (16KB) + Core L#1 + PU L#1 (P#1) L2 L#1 (2048KB) + L1i L#1 (64KB) L1d L#2 (16KB) + Core L#2 + PU L#2 (P#2) L1d L#3 (16KB) + Core L#3 + PU L#3 (P#3) L2 L#2 (2048KB) + L1i L#2 (64KB) L1d L#4 (16KB) + Core L#4 + PU L#4 (P#4) L1d L#5 (16KB) + Core L#5 + PU L#5 (P#5) L2 L#3 (2048KB) + L1i L#3 (64KB) L1d L#6 (16KB) + Core L#6 + PU L#6 (P#6) L1d L#7 (16KB) + Core L#7 + PU L#7 (P#7) NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB) L2 L#4 (2048KB) + L1i L#4 (64KB) L1d L#8 (16KB) + Core L#8 + PU L#8 (P#8) L1d L#9 (16KB) + Core L#9 + PU L#9 (P#9) L2 L#5 (2048KB) + L1i L#5 (64KB) L1d L#10 (16KB) + Core L#10 + PU L#10 (P#10) L1d L#11 (16KB) + Core L#11 + PU L#11 (P#11) L2 L#6 (2048KB) + L1i L#6 (64KB) L1d L#12 (16KB) + Core L#12 + PU L#12 (P#12) L1d L#13 (16KB) + Core L#13 + PU L#13 (P#13) L2 L#7 (2048KB) + L1i L#7 (64KB) L1d L#14 (16KB) + Core L#14 + PU L#14 (P#14) L1d L#15 (16KB) + Core L#15 + PU L#15 (P#15) Socket L#1 (64GB) NUMANode L#2 (P#2 32GB) + L3 L#2 (6144KB) L2 L#8 (2048KB) + L1i L#8 (64KB) L1d L#16 (16KB) + Core L#16 + PU L#16 (P#16) L1d L#17 (16KB) + Core L#17 + PU L#17 (P#17) L2 L#9 (2048KB) + L1i L#9 (64KB) L1d L#18 (16KB) + Core L#18 + PU L#18 (P#18) L1d L#19 (16KB) + Core L#19 + PU L#19 (P#19) L2 L#10 (2048KB) + L1i L#10 (64KB) L1d L#20 (16KB) + Core L#20 + PU L#20 (P#20) L1d L#21 (16KB) + Core L#21 + PU L#21 (P#21) L2 L#11 (2048KB) + L1i L#11 (64KB) L1d L#22 (16KB) + Core L#22 + PU L#22 (P#22) L1d L#23 (16KB) + Core L#23 + PU L#23 (P#23) NUMANode L#3 (P#3 32GB) + L3 L#3 (6144KB) L2 L#12 (2048KB) + L1i L#12 (64KB) L1d L#24 (16KB) + Core L#24 + PU L#24 (P#24) L1d L#25 (16KB) + Core L#25 + PU L#25 (P#25) L2 L#13 (2048KB) + L1i L#13 (64KB) L1d L#26 (16KB) + Core L#26 + PU L#26 (P#26) L1d L#27 (16KB) + Core L#27 + PU L#27 (P#27) L2 L#14 (2048KB) + L1i L#14 (64KB) L1d L#28 (16KB) + Core L#28 + PU L#28 (P#28) L1d L#29 (16KB) + Core L#29 + PU L#29 (P#29) L2 L#15 (2048KB) + L1i L#15 (64KB) L1d L#30 (16KB) + Core L#30 + PU L#30 (P#30) L1d L#31 (16KB) + Core L#31 + PU L#31 (P#31) Socket L#2 (64GB) NUMANode L#4 (P#4 32GB) + L3 L#4 (6144KB) L2 L#16 (2048KB) + L1i L#16 (64KB) L1d L#32 (16KB) + Core L#32 + PU L#32 (P#32) L1d L#33 (16KB) + Core L#33 + PU L#33 (P#33) L2 L#17 (2048KB) + L1i L#17 (64KB) L1d L#34 (16KB) + Core L#34 + PU L#34 (P#34) L1d L#35 (16KB) + Core L#35 + PU L#35 (P#35) L2 L#18 (2048KB) + L1i L#18 (64KB) L1d L#36 (16KB) + Core L#36 + PU L#36 (P#36) L1d L#37 (16KB) + Core L#37 + PU L#37 (P#37) L2 L#19 (2048KB) + L1i L#19 (64KB) L1d L#38 (16KB) + Core L#38 + PU L#38 (P#38) L1d L#39 (16KB) + Core L#39 + PU L#39 (P#39) NUMANode L#5 (P#5 32GB) + L3 L#5 (6144KB) L2 L#20 (2048KB) + L1i L#20 (64KB) L1d L#40 (16KB) + Core L#40 + PU L#40 (P#40) L1d L#41 (16KB) + Core L#41 + PU L#41 (P#41) L2 L#21 (2048KB) + L1i L#21 (64KB) L1d L#42 (16KB) + Core L#42 + PU L#42 (P#42) L1d L#43 (16KB) + Core L#43 + PU L#43 (P#43) L2 L#22 (2048KB) + L1i L#22 (64KB) L1d L#44 (16KB) + Core L#44 + PU L#44 (P#44) L1d L#45 (16KB) + Core L#45 + PU L#45 (P#45) L2 L#23 (2048KB) + L1i L#23 (64KB) L1d L#46 (16KB) + Core L#46 + PU L#46 (P#46) L1d L#47 (16KB) + Core L#47 + PU L#47 (P#47) Socket L#3 (64GB) NUMANode L#6 (P#6 32GB) + L3 L#6 (6144KB) L2 L#24 (2048KB) + L1i L#24 (64KB) L1d L#48 (16KB) + Core L#48 + PU L#48 (P#48) L1d L#49 (16KB) + Core L#49 + PU L#49 (P#49) L2 L#25 (2048KB) + L1i L#25 (64KB) L1d L#50 (16KB) + Core L#50 + PU L#50 (P#50) L1d L#51 (16KB) + Core L#51 + PU L#51 (P#51) L2 L#26 (2048KB) + L1i L#26 (64KB) L1d L#52 (16KB) + Core L#52 + PU L#52 (P#52) L1d L#53 (16KB) + Core L#53 + PU L#53 (P#53) L2 L#27 (2048KB) + L1i L#27 (64KB) L1d L#54 (16KB) + Core L#54 + PU L#54 (P#54) L1d L#55 (16KB) + Core L#55 + PU L#55 (P#55) NUMANode L#7 (P#7 32GB) + L3 L#7 (6144KB) L2 L#28 (2048KB) + L1i L#28 (64KB) L1d L#56 (16KB) + Core L#56 + PU L#56 (P#56) L1d L#57 (16KB) + Core L#57 + PU L#57 (P#57) L2 L#29 (2048KB) + L1i L#29 (64KB) L1d L#58 (16KB) + Core L#58 + PU L#58 (P#58) L1d L#59 (16KB) + Core L#59 + PU L#59 (P#59) L2 L#30 (2048KB) + L1i L#30 (64KB) L1d L#60 (16KB) + Core L#60 + PU L#60 (P#60) L1d L#61 (16KB) + Core L#61 + PU L#61 (P#61) L2 L#31 (2048KB) + L1i L#31 (64KB) L1d L#62 (16KB) + Core L#62 + PU L#62 (P#62) L1d L#63 (16KB) + Core L#63 + PU L#63 (P#63) HostBridge L#0 PCIBridge PCI 8086:10c9 Net L#0 "eth0" PCI 8086:10c9 Net L#1 "eth1" PCI 1002:4390 Block L#2 "sda" PCI 1002:439c PCIBridge PCI 1a03:2000 HostBridge L#3 PCIBridge PCI 15b3:1003 Net L#3 "ib0" OpenFabrics L#4 "mlx4_0"

Keywords

NUMA, thread binding, core binding, thread pinning

References

[1] Wikipedia - Non-uniform memory access

[2] OpenMP API Specifications

[3] GOMP project web site

[4] - Ch.4, Sec.4