|
| 1 | +# XHC: XPMEM-based Hierarchical Collectives |
| 2 | + |
| 3 | +The XHC component, implements hierarchical & topology-aware intra-node MPI |
| 4 | +collectives, utilizing XPMEM in order to achieve efficient shared address space |
| 5 | +memory access between processes. |
| 6 | + |
| 7 | +## Main features |
| 8 | + |
| 9 | +* Constructs an **n-level hierarchy** (i.e. no algorithmic limitation on level |
| 10 | +count), following the system's hardware topology. Ranks/processes are grouped |
| 11 | +together according to their relative locations; this information is known |
| 12 | +thanks to Hwloc, and is obtained via OpenMPI's integrated book-keeping. |
| 13 | + |
| 14 | + Topological features that can currently be defined (configurable via MCA params): |
| 15 | + |
| 16 | + - NUMA node |
| 17 | + - CPU Socket |
| 18 | + - L1, L2, L3 cache |
| 19 | + - Hwthread, core |
| 20 | + - Node/flat (no hierarchy) |
| 21 | + |
| 22 | + Example of a 3-level XHC hierarchy (sensitivity to numa & socket locality): |
| 23 | + |
| 24 | +  |
| 25 | + |
| 26 | + Furthermore, support for custom virtual user-defined hierarchies is |
| 27 | + available, to aid when fine-grained control over the communication pattern |
| 28 | + is necessary. |
| 29 | + |
| 30 | +* Support for both **zero-copy** and **copy-in-copy-out** data transportation. |
| 31 | + - Switchover at static but configurable message size. |
| 32 | + |
| 33 | + - CICO buffers permanently attached at module initialization |
| 34 | + |
| 35 | + - Application buffers attached on the fly the first time they appear, saved |
| 36 | + on and recovered from registration cache in subsequent appearances. |
| 37 | + (assuming smsc/xpmem) |
| 38 | + |
| 39 | +* Integration with Open MPI's `opal/smsc` (shared-memory-single-copy) |
| 40 | +framework. Selection of `smsc/xpmem` is highly recommended. |
| 41 | + |
| 42 | + - Bcast support: XPMEM, CMA, KNEM |
| 43 | + - Allreduce support: XPMEM |
| 44 | + - Barrier support: *(all, irrelevant)* |
| 45 | + |
| 46 | +* Data-wise **pipelining** across all levels of the hierarchy allows for |
| 47 | +lowering hierarchy-induced start-up overheads. Pipelining also allows for |
| 48 | +interleaving of operations in some collectives (reduce+bcast in allreduce). |
| 49 | + |
| 50 | +* **Lock-free** single-writer synchronization, with cache-line separation where |
| 51 | +necessary/beneficial. Consistency ensured via lightweight memory barriers. |
| 52 | + |
| 53 | +## Configuration options -- MCA params |
| 54 | + |
| 55 | +XHC can be customized via a number of standard Open MPI MCA parameters, though |
| 56 | +defaults that should satisfy a wide number of systems are in place. |
| 57 | + |
| 58 | +The available parameters: |
| 59 | + |
| 60 | +#### *(prepend with "coll_xhc_")* |
| 61 | +*(list may be outdated, please also check `ompi_info` and `coll_xhc_component.c`)* |
| 62 | + |
| 63 | +* **priority** (default `0`): The priority of the coll/xhc component, used |
| 64 | +during the component selection process. |
| 65 | + |
| 66 | +* **print_info** (default `false`): Print information about XHC's generated |
| 67 | +hierarchy and its configuration. |
| 68 | + |
| 69 | +* **shmem_backing** (default `/dev/shm`): Backing directory for shmem files |
| 70 | +used for XHC's synchronization fields and CICO buffers. |
| 71 | + |
| 72 | +* **dynamic_leader** (default `false`): Enables the feature that dynamically |
| 73 | +elects an XHC group leader at each collective (currently only applicable |
| 74 | +to bcast). |
| 75 | + |
| 76 | +* **dynamic_reduce** (default `1`=`non-float`): Controls the |
| 77 | +feature that allows for out-of-order reduction. XHC ranks reduce chunks |
| 78 | +directly from multiple peers' buffers; dynamic reduction allows them to |
| 79 | +temporarily skip a peer when the expected data is not yet prepared, instead of |
| 80 | +stalling. Setting to `2`=`all`, might/will harm reproducibility of float-based |
| 81 | +reductions. |
| 82 | + |
| 83 | +* **coll_xhc_lb_reduce_leader_assist** (default `top,first`): Controls the |
| 84 | +leader-to-member load balancing mode in reductions. If set to none/empty (`""`) |
| 85 | +only non-leader group members perform reductions. With `top` in the list, the |
| 86 | +leader of the top-most level also performs reductions in his group. With |
| 87 | +`first` in the list, leaders will help in the reduction workload for just one |
| 88 | +chunk at the beginning of the operation. If `all` is specified, all group |
| 89 | +members, including the leaders, perform reductions indiscriminately. |
| 90 | + |
| 91 | +* **force_reduce** (default `false`): Force enable the "special" Reduce |
| 92 | +implementation for all calls to MPI_Reduce. This implementation assumes that |
| 93 | +the `rbuf` parameter to MPI_Reduce is valid and appropriately sized for all |
| 94 | +ranks; not just the root -- you have to make sure that this is indeed the case |
| 95 | +with the application at hand. Only works with `root = 0`. |
| 96 | + |
| 97 | +* **hierarchy** (default `"numa,socket"`): A comma separated list of |
| 98 | +topological feature to which XHC's hierarchy-building algorithm should be |
| 99 | +sensitive. `ompi_info` reports the possible values for the parameter. |
| 100 | + |
| 101 | + - In some ways, this is "just" a suggestion. The resulting hierarchy may |
| 102 | + not exactly match the requested one. Reasons that this will occur: |
| 103 | + |
| 104 | + - A requested topological feature does not effectively segment the set |
| 105 | + of ranks. (eg. `numa` was specified, but all ranks reside in the same |
| 106 | + NUMA node) |
| 107 | + |
| 108 | + - No feature that all ranks have in common was provided. This a more |
| 109 | + intrinsic detail, that you probably don't need to be aware of, but you |
| 110 | + might come across if eg. you investigate the output of `print_info`. An |
| 111 | + additional level will automatically be added in this case, no need to |
| 112 | + worry about it. |
| 113 | + |
| 114 | + For all intents and purposes, a hierarchy of `numa,socket` is |
| 115 | + interpreted as "segment the ranks according to NUMA node locality, |
| 116 | + and then further segment them according to CPU socket locality". |
| 117 | + Three groups will be created: the intra-NUMA one, the intra-socket |
| 118 | + one, and an intra-node one. |
| 119 | + |
| 120 | + - The provided features will automatically be re-ordered when their |
| 121 | + order does not match their order in the physical system. (unless a |
| 122 | + virtual hierarchy feature is present in the list) |
| 123 | + |
| 124 | + - *Virtual Hierarchies*: The string may alternatively also contain "rank |
| 125 | + lists" which specify exactly which ranks to group together, as well as some |
| 126 | + other special modifiers. See in |
| 127 | + `coll_xhc_component.c:xhc_component_parse_hierarchy()` for further |
| 128 | + explanation as well as syntax information. |
| 129 | + |
| 130 | +* **chunk_size** (default `16K`): The chunk size for the pipelining process. |
| 131 | +Data is processed (eg broadcast, reduced) in this-much sized pieces at once. |
| 132 | + |
| 133 | + - It's possible to have a different chunk size for each level of the |
| 134 | + hierarchy, achieved via providing a comma-separated list of sizes (eg. |
| 135 | + `"16K,16K,128K"`) instead of single one. The sizes in this list's *DO NOT* |
| 136 | + correspond to the items on hierarchy list; the hierarchy keys might be |
| 137 | + re-ordered or reduced to match the system, but the chunk sizes will be |
| 138 | + consumed in the order they are given, left-to-right -> bottom-to-top. |
| 139 | + |
| 140 | +* **uniform_chunks** (default `true`): Automatically optimize the chunk size |
| 141 | +in reduction collectives, according to the message size, so that all members |
| 142 | +will perform equal work. |
| 143 | + |
| 144 | +* **uniform_chunks_min** (default `1K`): The lowest allowed value for the chunk |
| 145 | +size when uniform chunks are enabled. Each worker will reduce at least this much |
| 146 | +data, or we don't bother splitting the workload up. |
| 147 | + |
| 148 | +* **cico_max** (default `1K`): Copy-in-copy-out, instead of single-copy, will be |
| 149 | +used for messages of *cico_max* or less bytes. |
| 150 | + |
| 151 | +*(Removed Parameters)* |
| 152 | + |
| 153 | +* **rcache_max**, **rcache_max_global** *(REMOVED with shift to opal/smsc)*: |
| 154 | +Limit to number of attachments that the registration cache should hold. |
| 155 | + |
| 156 | + - A case can be made about their usefulness. If desired, should be |
| 157 | + re-implemented at smsc-level. |
| 158 | + |
| 159 | +## Limitations |
| 160 | + |
| 161 | +- *Intra-node support only* |
| 162 | + - Usage in multi-node scenarios is possible via OpenMPI's HAN. |
| 163 | + |
| 164 | +- **Heterogeneity**: XHC does not support nodes with non-uniform (rank-wise) |
| 165 | +datatype representations. (determined according to `proc_arch` field) |
| 166 | + |
| 167 | +- **Non-commutative** operators are not supported by XHC's reduction |
| 168 | +collectives. In past versions, they were supported, but only with the flat |
| 169 | +hierarchy configuration; this could make a return at some point. |
| 170 | + |
| 171 | +- XHC's Reduce is not fully complete. Instead, it is a "special" implementation |
| 172 | +of MPI_Reduce, that is realized as a sub-case of XHC's Allreduce. |
| 173 | + |
| 174 | + - If the caller guarantees that the `rbuf` parameter is valid for all ranks |
| 175 | + (not just the root), like in Allreduce, this special Reduce can be invoked |
| 176 | + by specifying `root=-1`, which will trigger a Reduce to rank `0` (the only |
| 177 | + one currently supported). |
| 178 | + |
| 179 | + - Current prime use-case: HAN's Allreduce |
| 180 | + |
| 181 | + - Furthermore, if it is guaranteed that all Reduce calls in an application |
| 182 | + satisfy the above criteria, see about the `force_reduce` MCA parameter. |
| 183 | + |
| 184 | + - XHC's Reduce is not yet fully optimized for small messages. |
| 185 | + |
| 186 | +## Building |
| 187 | + |
| 188 | +XHC is built as a standard mca/coll component. |
| 189 | + |
| 190 | +To reap its full benefits, XPMEM support in OpenMPI is required. XHC will build |
| 191 | +and work without it, but the reduction operations will be disabled and |
| 192 | +broadcast will fall-back to less efficient mechanisms (CMA, KNEM). |
| 193 | + |
| 194 | +## Running |
| 195 | + |
| 196 | +In order for the XHC component to be chosen, make sure that its priority is |
| 197 | +higher than other components that provide the collectives of interest; use the |
| 198 | +`coll_xhc_priority` MCA parameter. If a list of collective modules is included |
| 199 | +via the `coll` MCA parameter, make sure XHC is in the list. |
| 200 | + |
| 201 | +* You may also want to add the `--bind-to core` param. Otherwise, the reported |
| 202 | +process localities might be too general, preventing XHC from correctly |
| 203 | +segmenting the system. (`coll_xhc_print_info` will report the generated |
| 204 | +hierarchy) |
| 205 | + |
| 206 | +### Tuning |
| 207 | + |
| 208 | +* Optional: You might wish to manually specify the topological features that |
| 209 | +XHC's hierarchy should conform to. The default is `numa,socket`, which will |
| 210 | +group the processes according to NUMA locality and then further group them |
| 211 | +according to socket locality. See the `coll_xhc_hierarchy` param. |
| 212 | + |
| 213 | + - Example: `--mca coll_xhc_hierarchy numa,socket` |
| 214 | + - Example: `--mca coll_xhc_hierarchy numa` |
| 215 | + - Example: `--mca coll_xhc_hierarchy flat` |
| 216 | + |
| 217 | + In some systems, small-message Broadcast or the Barrier operation might |
| 218 | + perform better with a flat tree instead of a hierarchical one. Currently, |
| 219 | + manual benchmarking is required to accurately determine this. |
| 220 | + |
| 221 | +* Optional: You might wish to tune XHC's chunk size (default `16K`). Use the |
| 222 | +`coll_xhc_chunk_size` param, and try values close to the default and see if |
| 223 | +improvements are observed. You may even try specifying different chunk sizes |
| 224 | +for each hierarchy level -- use the same process, starting from the same chunk |
| 225 | +size for all levels and decreasing/increasing from there. |
| 226 | + |
| 227 | + - Example: `--mca coll_xhc_chunk_size 16K` |
| 228 | + - Example: `--mca coll_xhc_chunk_size 16K,32K,128K` |
| 229 | + |
| 230 | +* Optional: If you wish to focus on latencies of small messages, you can try |
| 231 | +altering the cico-to-zcopy switchover point (`coll_xhc_cico_max`, default |
| 232 | +`1K`). |
| 233 | + |
| 234 | + - Example: `--mca coll_xhc_cico_max 1K` |
| 235 | + |
| 236 | +* Optional: If your application is heavy in Broadcast calls and you suspect |
| 237 | +that specific ranks might be joining the collective with delay and causing |
| 238 | +others to stall waiting for them, you could try enabling dynamic leadership |
| 239 | +(`coll_xhc_dynamic_leader`), and seeing if it marks an improvement. |
| 240 | + |
| 241 | + - Example: `--mca coll_xhc_dynamic_leader 1` |
| 242 | + |
| 243 | +### Example command lines |
| 244 | + |
| 245 | +*Assuming `PATH` and `LD_LIBRARY_PATH` have been set appropriately.* |
| 246 | + |
| 247 | +Default XHC configuration: |
| 248 | +`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --bind-to core <application>` |
| 249 | + |
| 250 | +XHC w/ numa-sensitive hierarchy, chunk size @ 16K: |
| 251 | +`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --mca coll_xhc_hierarchy numa --mca coll_xhc_chunk_size 16K --bind-to core <application>` |
| 252 | + |
| 253 | +XHC with flat hierarchy (ie. none at all): |
| 254 | +`$ mpirun --mca coll libnbc,basic,xhc --mca coll_xhc_priority 100 --mca coll_xhc_hierarchy node [--bind-to core] <application>` |
| 255 | + |
| 256 | +## Publications |
| 257 | + |
| 258 | +1. **A framework for hierarchical single-copy MPI collectives on multicore nodes**, |
| 259 | +*George Katevenis, Manolis Ploumidis, Manolis Marazakis*, |
| 260 | +IEEE Cluster 2022, Heidelberg, Germany. |
| 261 | +https://ieeexplore.ieee.org/document/9912729 |
| 262 | + |
| 263 | +## Contact |
| 264 | + |
| 265 | +- George Katevenis ( [email protected]) |
| 266 | +- Manolis Ploumidis ( [email protected]) |
| 267 | + |
| 268 | +Computer Architecture and VLSI Systems (CARV) Laboratory, ICS Forth |
| 269 | + |
| 270 | +## Acknowledgments |
| 271 | + |
| 272 | +We thankfully acknowledge the support of the European Commission and the Greek |
| 273 | +General Secretariat for Research and Innovation under the EuroHPC Programme |
| 274 | +through the **DEEP-SEA** project (GA 955606). National contributions from the |
| 275 | +involved state members (including the Greek General Secretariat for Research |
| 276 | +and Innovation) match the EuroHPC funding. |
| 277 | + |
| 278 | +This work is partly supported by project **EUPEX**, which has received funding |
| 279 | +from the European High-Performance Computing Joint Undertaking (JU) under grant |
| 280 | +agreement No 101033975. The JU receives support from the European Union's |
| 281 | +Horizon 2020 re-search and innovation programme and France, Germany, Italy, |
| 282 | +Greece, United Kingdom, Czech Republic, Croatia. |
0 commit comments