Thread Affinity and NUMA
WORK IN PROGRESS -- not at all complete
Thread affinity, also known as CPU affinity, core affinity, core binding, thread binding, thread pinning, etc., is a way of assigning individual threads of execution to their own processor core, possibly giving better performance.
NUMA is 'Non-Uniform Memory Access, a memory design which allows faster access from processor cores located closer to the memory device.[1]
Explicit Binding
Grid Engine is able to explicitly specify the processor cores which are assigned to a job. It keeps track of which cores have been assigned to jobs, and only assigns those cores which are free for use.
However, there are different ways to specify the processor cores available. OpenMP has one format, MPI has a different format, and additionally, there may be vendor-specific formats. E.g., standard OpenMP syntax for explicit binding:
export OMP_PLACES="{0,4,8,12}"
and the equivalent for Intel Composer:
export KMP_AFFINITY="granularity=fine,proclist=[0,4,8,12],explicit"
While the PE_HOSTFILE
contains:
ic03n01.cm.cluster 4 all.q@ic03n01.cm.cluster 0,0:0,4:1,0:1,4
i.e.
The trick is to translate Grid Engine's processor list to the appropriate format. This is done within the Parallel Environments, so that users do not have to handle the format transformation.
Overriding Defaults
Your job script may override the binding specifications by setting
values for the appropriate environment variables. In doing so, you must
be careful to use only those cores available to your job, as given by
the contents of $PE_HOSTFILE
.
Automatic Binding
cgroups limits jobs to have access only to those cores which have been assigned the job.
OpenMP
OpenMP is an application programming interface (API) for shared-memory parallelism, which usually means multi-threaded processing in commodity servers. OpenMP is merely an API specification.[2] The latest published API specification is version 4.0. Compiler vendors may decide to support some earlier versions of the API. A list of compilers and the respective supported OpenMP versions is here.
N.B. OpenMP is different from MPI (Message-Passing Interface), which may be known as "coarse-grained parallelism". OpenMP may also be termed "fine-grained parallelism". However, code may be written as hybrid OpenMP-MPI programs, with the node-local parallelism using OpenMP, and inter-node parallelism using MPI. See below.
GCC 4.8 with OpenMP
The GCC project's implementation of OpenMP is called GOMP.[3] GCC 4.8
(available using the module gcc/4.8.1
) supports OpenMP 3.1. See
To enable thread affinity, set two environment variables in your job
script, which requests the "shm
" parallel environment:[4]
#$ -pe shm 8
#$ -binding pe linear:8
...
export OMP_NUM_THREADS=$NSLOTS
export OMP_PROC_BIND=true
./myprogram
Intel Composer XE with OpenMP
Intel provides support for three different OpenMP implementations:
standard OpenMP using OMP_*
environment variables, GNU OpenMP using
GOMP_*
environment variables, and its own using KMP_*
environment
variables.
Information about thread layout will be printed out by setting:
export KMP_AFFINITY="verbose,none"
However, setting KMP_AFFINITY
will override all binding-related
OMP_*
environment variables, so OMP_PROC_BIND
and OMP_PLACES
will
be ignored. Similarly, GOMP_CPU_AFFINITY
will be ignored.
$PE_HOSTFILE from Grid Engine
See section on "-binding
" in man page for qsub(1)
Job:
- -pe intelmpi 62
- -binding pe striding:2:8
Env. vars.
- NHOSTS = 4
- NSLOTS = 62
- PE_HOSTFILE contents:
ic19n01.cm.cluster 16 debug.q@ic19n01.cm.cluster 0,0:1,0
ic19n04.cm.cluster 16 debug.q@ic19n04.cm.cluster 0,0:1,0
ic19n02.cm.cluster 16 debug.q@ic19n02.cm.cluster 0,0:1,0
ic10n04.cm.cluster 14 debug.q@ic10n04.cm.cluster 0,0:1,0
Format:
hostname n_slots queue_name socket,core:socket,core
OpenMP expects cores to be numbered sequentially at the leaf level.
Intel Nodes
hwloc-info:
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
HostBridge L#0
PCIBridge
PCI 8086:1521
Net L#0 "eth0"
PCI 8086:1521
Net L#1 "eth1"
PCIBridge
PCIBridge
PCI 1a03:2000
PCI 8086:1d02
Block L#2 "sda"
NUMANode L#1 (P#1 32GB)
Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
HostBridge L#4
PCIBridge
PCI 15b3:1003
Net L#3 "ib0"
OpenFabrics L#4 "mlx4_0"
AMD Nodes
hwloc-ls:
Machine (256GB)
Socket L#0 (64GB)
NUMANode L#0 (P#0 32GB) + L3 L#0 (6144KB)
L2 L#0 (2048KB) + L1i L#0 (64KB)
L1d L#0 (16KB) + Core L#0 + PU L#0 (P#0)
L1d L#1 (16KB) + Core L#1 + PU L#1 (P#1)
L2 L#1 (2048KB) + L1i L#1 (64KB)
L1d L#2 (16KB) + Core L#2 + PU L#2 (P#2)
L1d L#3 (16KB) + Core L#3 + PU L#3 (P#3)
L2 L#2 (2048KB) + L1i L#2 (64KB)
L1d L#4 (16KB) + Core L#4 + PU L#4 (P#4)
L1d L#5 (16KB) + Core L#5 + PU L#5 (P#5)
L2 L#3 (2048KB) + L1i L#3 (64KB)
L1d L#6 (16KB) + Core L#6 + PU L#6 (P#6)
L1d L#7 (16KB) + Core L#7 + PU L#7 (P#7)
NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB)
L2 L#4 (2048KB) + L1i L#4 (64KB)
L1d L#8 (16KB) + Core L#8 + PU L#8 (P#8)
L1d L#9 (16KB) + Core L#9 + PU L#9 (P#9)
L2 L#5 (2048KB) + L1i L#5 (64KB)
L1d L#10 (16KB) + Core L#10 + PU L#10 (P#10)
L1d L#11 (16KB) + Core L#11 + PU L#11 (P#11)
L2 L#6 (2048KB) + L1i L#6 (64KB)
L1d L#12 (16KB) + Core L#12 + PU L#12 (P#12)
L1d L#13 (16KB) + Core L#13 + PU L#13 (P#13)
L2 L#7 (2048KB) + L1i L#7 (64KB)
L1d L#14 (16KB) + Core L#14 + PU L#14 (P#14)
L1d L#15 (16KB) + Core L#15 + PU L#15 (P#15)
Socket L#1 (64GB)
NUMANode L#2 (P#2 32GB) + L3 L#2 (6144KB)
L2 L#8 (2048KB) + L1i L#8 (64KB)
L1d L#16 (16KB) + Core L#16 + PU L#16 (P#16)
L1d L#17 (16KB) + Core L#17 + PU L#17 (P#17)
L2 L#9 (2048KB) + L1i L#9 (64KB)
L1d L#18 (16KB) + Core L#18 + PU L#18 (P#18)
L1d L#19 (16KB) + Core L#19 + PU L#19 (P#19)
L2 L#10 (2048KB) + L1i L#10 (64KB)
L1d L#20 (16KB) + Core L#20 + PU L#20 (P#20)
L1d L#21 (16KB) + Core L#21 + PU L#21 (P#21)
L2 L#11 (2048KB) + L1i L#11 (64KB)
L1d L#22 (16KB) + Core L#22 + PU L#22 (P#22)
L1d L#23 (16KB) + Core L#23 + PU L#23 (P#23)
NUMANode L#3 (P#3 32GB) + L3 L#3 (6144KB)
L2 L#12 (2048KB) + L1i L#12 (64KB)
L1d L#24 (16KB) + Core L#24 + PU L#24 (P#24)
L1d L#25 (16KB) + Core L#25 + PU L#25 (P#25)
L2 L#13 (2048KB) + L1i L#13 (64KB)
L1d L#26 (16KB) + Core L#26 + PU L#26 (P#26)
L1d L#27 (16KB) + Core L#27 + PU L#27 (P#27)
L2 L#14 (2048KB) + L1i L#14 (64KB)
L1d L#28 (16KB) + Core L#28 + PU L#28 (P#28)
L1d L#29 (16KB) + Core L#29 + PU L#29 (P#29)
L2 L#15 (2048KB) + L1i L#15 (64KB)
L1d L#30 (16KB) + Core L#30 + PU L#30 (P#30)
L1d L#31 (16KB) + Core L#31 + PU L#31 (P#31)
Socket L#2 (64GB)
NUMANode L#4 (P#4 32GB) + L3 L#4 (6144KB)
L2 L#16 (2048KB) + L1i L#16 (64KB)
L1d L#32 (16KB) + Core L#32 + PU L#32 (P#32)
L1d L#33 (16KB) + Core L#33 + PU L#33 (P#33)
L2 L#17 (2048KB) + L1i L#17 (64KB)
L1d L#34 (16KB) + Core L#34 + PU L#34 (P#34)
L1d L#35 (16KB) + Core L#35 + PU L#35 (P#35)
L2 L#18 (2048KB) + L1i L#18 (64KB)
L1d L#36 (16KB) + Core L#36 + PU L#36 (P#36)
L1d L#37 (16KB) + Core L#37 + PU L#37 (P#37)
L2 L#19 (2048KB) + L1i L#19 (64KB)
L1d L#38 (16KB) + Core L#38 + PU L#38 (P#38)
L1d L#39 (16KB) + Core L#39 + PU L#39 (P#39)
NUMANode L#5 (P#5 32GB) + L3 L#5 (6144KB)
L2 L#20 (2048KB) + L1i L#20 (64KB)
L1d L#40 (16KB) + Core L#40 + PU L#40 (P#40)
L1d L#41 (16KB) + Core L#41 + PU L#41 (P#41)
L2 L#21 (2048KB) + L1i L#21 (64KB)
L1d L#42 (16KB) + Core L#42 + PU L#42 (P#42)
L1d L#43 (16KB) + Core L#43 + PU L#43 (P#43)
L2 L#22 (2048KB) + L1i L#22 (64KB)
L1d L#44 (16KB) + Core L#44 + PU L#44 (P#44)
L1d L#45 (16KB) + Core L#45 + PU L#45 (P#45)
L2 L#23 (2048KB) + L1i L#23 (64KB)
L1d L#46 (16KB) + Core L#46 + PU L#46 (P#46)
L1d L#47 (16KB) + Core L#47 + PU L#47 (P#47)
Socket L#3 (64GB)
NUMANode L#6 (P#6 32GB) + L3 L#6 (6144KB)
L2 L#24 (2048KB) + L1i L#24 (64KB)
L1d L#48 (16KB) + Core L#48 + PU L#48 (P#48)
L1d L#49 (16KB) + Core L#49 + PU L#49 (P#49)
L2 L#25 (2048KB) + L1i L#25 (64KB)
L1d L#50 (16KB) + Core L#50 + PU L#50 (P#50)
L1d L#51 (16KB) + Core L#51 + PU L#51 (P#51)
L2 L#26 (2048KB) + L1i L#26 (64KB)
L1d L#52 (16KB) + Core L#52 + PU L#52 (P#52)
L1d L#53 (16KB) + Core L#53 + PU L#53 (P#53)
L2 L#27 (2048KB) + L1i L#27 (64KB)
L1d L#54 (16KB) + Core L#54 + PU L#54 (P#54)
L1d L#55 (16KB) + Core L#55 + PU L#55 (P#55)
NUMANode L#7 (P#7 32GB) + L3 L#7 (6144KB)
L2 L#28 (2048KB) + L1i L#28 (64KB)
L1d L#56 (16KB) + Core L#56 + PU L#56 (P#56)
L1d L#57 (16KB) + Core L#57 + PU L#57 (P#57)
L2 L#29 (2048KB) + L1i L#29 (64KB)
L1d L#58 (16KB) + Core L#58 + PU L#58 (P#58)
L1d L#59 (16KB) + Core L#59 + PU L#59 (P#59)
L2 L#30 (2048KB) + L1i L#30 (64KB)
L1d L#60 (16KB) + Core L#60 + PU L#60 (P#60)
L1d L#61 (16KB) + Core L#61 + PU L#61 (P#61)
L2 L#31 (2048KB) + L1i L#31 (64KB)
L1d L#62 (16KB) + Core L#62 + PU L#62 (P#62)
L1d L#63 (16KB) + Core L#63 + PU L#63 (P#63)
HostBridge L#0
PCIBridge
PCI 8086:10c9
Net L#0 "eth0"
PCI 8086:10c9
Net L#1 "eth1"
PCI 1002:4390
Block L#2 "sda"
PCI 1002:439c
PCIBridge
PCI 1a03:2000
HostBridge L#3
PCIBridge
PCI 15b3:1003
Net L#3 "ib0"
OpenFabrics L#4 "mlx4_0"
Keywords
NUMA, thread binding, core binding, thread pinning