Skip to content

Commit 8d66ec0

Browse files
committed
Add the XHC (XPMEM-based Hierarchical Collectives) component
XHC implements hierarchical and topology-aware intra-node collectives, using XPMEM for inter-process communication. See the README for more information. Signed-off-by: George Katevenis <[email protected]>
1 parent bebfc41 commit 8d66ec0

13 files changed

+5885
-0
lines changed

ompi/mca/coll/xhc/Makefile.am

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#
2+
# Copyright (c) 2021-2023 Computer Architecture and VLSI Systems (CARV)
3+
# Laboratory, ICS Forth. All rights reserved.
4+
# $COPYRIGHT$
5+
#
6+
# Additional copyrights may follow
7+
#
8+
# $HEADER$
9+
#
10+
11+
dist_opaldata_DATA = help-coll-xhc.txt
12+
13+
sources = \
14+
coll_xhc.h \
15+
coll_xhc_atomic.h \
16+
coll_xhc.c \
17+
coll_xhc_component.c \
18+
coll_xhc_module.c \
19+
coll_xhc_bcast.c \
20+
coll_xhc_barrier.c \
21+
coll_xhc_reduce.c \
22+
coll_xhc_allreduce.c
23+
24+
# Make the output library in this directory, and name it either
25+
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
26+
# (for static builds).
27+
28+
component_noinst =
29+
component_install =
30+
if MCA_BUILD_ompi_coll_xhc_DSO
31+
component_install += mca_coll_xhc.la
32+
else
33+
component_noinst += libmca_coll_xhc.la
34+
endif
35+
36+
mcacomponentdir = $(ompilibdir)
37+
mcacomponent_LTLIBRARIES = $(component_install)
38+
mca_coll_xhc_la_SOURCES = $(sources)
39+
mca_coll_xhc_la_LDFLAGS = -module -avoid-version
40+
mca_coll_xhc_la_LIBADD = $(top_builddir)/ompi/lib@[email protected]
41+
42+
noinst_LTLIBRARIES = $(component_noinst)
43+
libmca_coll_xhc_la_SOURCES = $(sources)
44+
libmca_coll_xhc_la_LDFLAGS = -module -avoid-version

ompi/mca/coll/xhc/README.md

Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
# XHC: XPMEM-based Hierarchical Collectives
2+
3+
The XHC component, implements hierarchical & topology-aware intra-node MPI
4+
collectives, utilizing XPMEM in order to achieve efficient shared address space
5+
memory access between processes.
6+
7+
## Main features
8+
9+
* Constructs an **n-level hierarchy** (i.e. no algorithmic limitation on level
10+
count), following the system's hardware topology. Ranks/processes are grouped
11+
together according to their relative locations; this information is known
12+
thanks to Hwloc, and is obtained via OpenMPI's integrated book-keeping.
13+
14+
Topological features that can currently be defined (configurable via MCA params):
15+
16+
- NUMA node
17+
- CPU Socket
18+
- L1, L2, L3 cache
19+
- Hwthread, core
20+
- Node/flat (no hierarchy)
21+
22+
Example of a 3-level XHC hierarchy (sensitivity to numa & socket locality):
23+
24+
![Example of 3-level XHC hierarchy](resources/xhc-hierarchy.svg)
25+
26+
Furthermore, support for custom virtual user-defined hierarchies is
27+
available, to aid when fine-grained control over the communication pattern
28+
is necessary.
29+
30+
* Support for both **zero-copy** and **copy-in-copy-out** data transportation.
31+
- Switchover at static but configurable message size.
32+
33+
- CICO buffers permanently attached at module initialization
34+
35+
- Application buffers attached on the fly the first time they appear, saved
36+
on and recovered from registration cache in subsequent appearances.
37+
(assuming smsc/xpmem)
38+
39+
* Integration with Open MPI's `opal/smsc` (shared-memory-single-copy)
40+
framework. Selection of `smsc/xpmem` is highly recommended.
41+
42+
- Bcast support: XPMEM, CMA, KNEM
43+
- Allreduce support: XPMEM
44+
- Barrier support: *(all, irrelevant)*
45+
46+
* Data-wise **pipelining** across all levels of the hierarchy allows for
47+
lowering hierarchy-induced start-up overheads. Pipelining also allows for
48+
interleaving of operations in some collectives (reduce+bcast in allreduce).
49+
50+
* **Lock-free** single-writer synchronization, with cache-line separation where
51+
necessary/beneficial. Consistency ensured via lightweight memory barriers.
52+
53+
## Configuration options -- MCA params
54+
55+
XHC can be customized via a number of standard Open MPI MCA parameters, though
56+
defaults that should satisfy a wide number of systems are in place.
57+
58+
The available parameters:
59+
60+
#### *(prepend with "coll_xhc_")*
61+
*(list may be outdated, please also check `ompi_info` and `coll_xhc_component.c`)*
62+
63+
* **priority** (default `0`): The priority of the coll/xhc component, used
64+
during the component selection process.
65+
66+
* **print_info** (default `false`): Print information about XHC's generated
67+
hierarchy and its configuration.
68+
69+
* **shmem_backing** (default `/dev/shm`): Backing directory for shmem files
70+
used for XHC's synchronization fields and CICO buffers.
71+
72+
* **dynamic_leader** (default `false`): Enables the feature that dynamically
73+
elects an XHC group leader at each collective (currently only applicable
74+
to bcast).
75+
76+
* **dynamic_reduce** (default `1`=`non-float`): Controls the
77+
feature that allows for out-of-order reduction. XHC ranks reduce chunks
78+
directly from multiple peers' buffers; dynamic reduction allows them to
79+
temporarily skip a peer when the expected data is not yet prepared, instead of
80+
stalling. Setting to `2`=`all`, might/will harm reproducibility of float-based
81+
reductions.
82+
83+
* **coll_xhc_lb_reduce_leader_assist** (default `top,first`): Controls the
84+
leader-to-member load balancing mode in reductions. If set to none/empty (`""`)
85+
only non-leader group members perform reductions. With `top` in the list, the
86+
leader of the top-most level also performs reductions in his group. With
87+
`first` in the list, leaders will help in the reduction workload for just one
88+
chunk at the beginning of the operation. If `all` is specified, all group
89+
members, including the leaders, perform reductions indiscriminately.
90+
91+
* **force_reduce** (default `false`): Force enable the "special" Reduce
92+
implementation for all calls to MPI_Reduce. This implementation assumes that
93+
the `rbuf` parameter to MPI_Reduce is valid and appropriately sized for all
94+
ranks; not just the root -- you have to make sure that this is indeed the case
95+
with the application at hand. Only works with `root = 0`.
96+
97+
* **hierarchy** (default `"numa,socket"`): A comma separated list of
98+
topological feature to which XHC's hierarchy-building algorithm should be
99+
sensitive. `ompi_info` reports the possible values for the parameter.
100+
101+
- In some ways, this is "just" a suggestion. The resulting hierarchy may
102+
not exactly match the requested one. Reasons that this will occur:
103+
104+
- A requested topological feature does not effectively segment the set
105+
of ranks. (eg. `numa` was specified, but all ranks reside in the same
106+
NUMA node)
107+
108+
- No feature that all ranks have in common was provided. This a more
109+
intrinsic detail, that you probably don't need to be aware of, but you
110+
might come across if eg. you investigate the output of `print_info`. An
111+
additional level will automatically be added in this case, no need to
112+
worry about it.
113+
114+
For all intents and purposes, a hierarchy of `numa,socket` is
115+
interpreted as "segment the ranks according to NUMA node locality,
116+
and then further segment them according to CPU socket locality".
117+
Three groups will be created: the intra-NUMA one, the intra-socket
118+
one, and an intra-node one.
119+
120+
- The provided features will automatically be re-ordered when their
121+
order does not match their order in the physical system. (unless a
122+
virtual hierarchy feature is present in the list)
123+
124+
- *Virtual Hierarchies*: The string may alternatively also contain "rank
125+
lists" which specify exactly which ranks to group together, as well as some
126+
other special modifiers. See in
127+
`coll_xhc_component.c:xhc_component_parse_hierarchy()` for further
128+
explanation as well as syntax information.
129+
130+
* **chunk_size** (default `16K`): The chunk size for the pipelining process.
131+
Data is processed (eg broadcast, reduced) in this-much sized pieces at once.
132+
133+
- It's possible to have a different chunk size for each level of the
134+
hierarchy, achieved via providing a comma-separated list of sizes (eg.
135+
`"16K,16K,128K"`) instead of single one. The sizes in this list's *DO NOT*
136+
correspond to the items on hierarchy list; the hierarchy keys might be
137+
re-ordered or reduced to match the system, but the chunk sizes will be
138+
consumed in the order they are given, left-to-right -> bottom-to-top.
139+
140+
* **uniform_chunks** (default `true`): Automatically optimize the chunk size
141+
in reduction collectives, according to the message size, so that all members
142+
will perform equal work.
143+
144+
* **uniform_chunks_min** (default `1K`): The lowest allowed value for the chunk
145+
size when uniform chunks are enabled. Each worker will reduce at least this much
146+
data, or we don't bother splitting the workload up.
147+
148+
* **cico_max** (default `1K`): Copy-in-copy-out, instead of single-copy, will be
149+
used for messages of *cico_max* or less bytes.
150+
151+
*(Removed Parameters)*
152+
153+
* **rcache_max**, **rcache_max_global** *(REMOVED with shift to opal/smsc)*:
154+
Limit to number of attachments that the registration cache should hold.
155+
156+
- A case can be made about their usefulness. If desired, should be
157+
re-implemented at smsc-level.
158+
159+
## Limitations
160+
161+
- *Intra-node support only*
162+
- Usage in multi-node scenarios is possible via OpenMPI's HAN.
163+
164+
- **Heterogeneity**: XHC does not support nodes with non-uniform (rank-wise)
165+
datatype representations. (determined according to `proc_arch` field)
166+
167+
- **Non-commutative** operators are not supported by XHC's reduction
168+
collectives. In past versions, they were supported, but only with the flat
169+
hierarchy configuration; this could make a return at some point.
170+
171+
- XHC's Reduce is not fully complete. Instead, it is a "special" implementation
172+
of MPI_Reduce, that is realized as a sub-case of XHC's Allreduce.
173+
174+
- If the caller guarantees that the `rbuf` parameter is valid for all ranks
175+
(not just the root), like in Allreduce, this special Reduce can be invoked
176+
by specifying `root=-1`, which will trigger a Reduce to rank `0` (the only
177+
one currently supported).
178+
179+
- Current prime use-case: HAN's Allreduce
180+
181+
- Furthermore, if it is guaranteed that all Reduce calls in an application
182+
satisfy the above criteria, see about the `force_reduce` MCA parameter.
183+
184+
- XHC's Reduce is not yet fully optimized for small messages.
185+
186+
## Building
187+
188+
XHC is built as a standard mca/coll component.
189+
190+
To reap its full benefits, XPMEM support in OpenMPI is required. XHC will build
191+
and work without it, but the reduction operations will be disabled and
192+
broadcast will fall-back to less efficient mechanisms (CMA, KNEM).
193+
194+
## Running
195+
196+
In order for the XHC component to be chosen, make sure that its priority is
197+
higher than other components that provide the collectives of interest; use the
198+
`coll_xhc_priority` MCA parameter. If a list of collective modules is included
199+
via the `coll` MCA parameter, make sure XHC is in the list.
200+
201+
* You may also want to add the `--bind-to core` param. Otherwise, the reported
202+
process localities might be too general, preventing XHC from correctly
203+
segmenting the system. (`coll_xhc_print_info` will report the generated
204+
hierarchy)
205+
206+
### Tuning
207+
208+
* Optional: You might wish to manually specify the topological features that
209+
XHC's hierarchy should conform to. The default is `numa,socket`, which will
210+
group the processes according to NUMA locality and then further group them
211+
according to socket locality. See the `coll_xhc_hierarchy` param.
212+
213+
- Example: `--mca coll_xhc_hierarchy numa,socket`
214+
- Example: `--mca coll_xhc_hierarchy numa`
215+
- Example: `--mca coll_xhc_hierarchy flat`
216+
217+
In some systems, small-message Broadcast or the Barrier operation might
218+
perform better with a flat tree instead of a hierarchical one. Currently,
219+
manual benchmarking is required to accurately determine this.
220+
221+
* Optional: You might wish to tune XHC's chunk size (default `16K`). Use the
222+
`coll_xhc_chunk_size` param, and try values close to the default and see if
223+
improvements are observed. You may even try specifying different chunk sizes
224+
for each hierarchy level -- use the same process, starting from the same chunk
225+
size for all levels and decreasing/increasing from there.
226+
227+
- Example: `--mca coll_xhc_chunk_size 16K`
228+
- Example: `--mca coll_xhc_chunk_size 16K,32K,128K`
229+
230+
* Optional: If you wish to focus on latencies of small messages, you can try
231+
altering the cico-to-zcopy switchover point (`coll_xhc_cico_max`, default
232+
`1K`).
233+
234+
- Example: `--mca coll_xhc_cico_max 1K`
235+
236+
* Optional: If your application is heavy in Broadcast calls and you suspect
237+
that specific ranks might be joining the collective with delay and causing
238+
others to stall waiting for them, you could try enabling dynamic leadership
239+
(`coll_xhc_dynamic_leader`), and seeing if it marks an improvement.
240+
241+
- Example: `--mca coll_xhc_dynamic_leader 1`
242+
243+
### Example command lines
244+
245+
*Assuming `PATH` and `LD_LIBRARY_PATH` have been set appropriately.*
246+
247+
Default XHC configuration:
248+
`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --bind-to core <application>`
249+
250+
XHC w/ numa-sensitive hierarchy, chunk size @ 16K:
251+
`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --mca coll_xhc_hierarchy numa --mca coll_xhc_chunk_size 16K --bind-to core <application>`
252+
253+
XHC with flat hierarchy (ie. none at all):
254+
`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --mca coll_xhc_hierarchy node [--bind-to core] <application>`
255+
256+
## Publications
257+
258+
1. **A framework for hierarchical single-copy MPI collectives on multicore nodes**,
259+
*George Katevenis, Manolis Ploumidis, Manolis Marazakis*,
260+
IEEE Cluster 2022, Heidelberg, Germany.
261+
https://ieeexplore.ieee.org/document/9912729
262+
263+
## Contact
264+
265+
- George Katevenis ([email protected])
266+
- Manolis Ploumidis ([email protected])
267+
268+
Computer Architecture and VLSI Systems (CARV) Laboratory, ICS Forth
269+
270+
## Acknowledgments
271+
272+
We thankfully acknowledge the support of the European Commission and the Greek
273+
General Secretariat for Research and Innovation under the EuroHPC Programme
274+
through the **DEEP-SEA** project (GA 955606). National contributions from the
275+
involved state members (including the Greek General Secretariat for Research
276+
and Innovation) match the EuroHPC funding.
277+
278+
This work is partly supported by project **EUPEX**, which has received funding
279+
from the European High-Performance Computing Joint Undertaking (JU) under grant
280+
agreement No 101033975. The JU receives support from the European Union's
281+
Horizon 2020 re-search and innovation programme and France, Germany, Italy,
282+
Greece, United Kingdom, Czech Republic, Croatia.

0 commit comments

Comments
 (0)