Description
Hi, while testing the new HAN+XHC integration method, I came upon a bug/issue in HAN's reduce, in this part of the code:
ompi/ompi/mca/coll/han/coll_han_reduce.c
Lines 176 to 178 in 9216ad4
This condition will be true for non-root ranks in the same node as the root. But, for these ranks, rbuf
has been previously initialized to NULL. Thus, t->rbuf
will be set to something non-NULL, but won't point to valid memory. Later on, this influences the rbuf
parameter to coll_reduce
:
ompi/ompi/mca/coll/han/coll_han_reduce.c
Lines 248 to 256 in 9216ad4
This is a problem when trying to detect (in XHC) if rbuf is valid or not, as this is done by checking if the pointer is NULL.
Related: #11552 and the discussion in #11418
I'm posting this as an issue instead of a PR, with the hopes that someone will take it through the last mile, as I'm not fully sure what the desired fix for this would be, or if there are similar occurrences in other HAN collectives that should also be adjusted. Something like this does fix it:
diff --git a/ompi/mca/coll/han/coll_han_reduce.c b/ompi/mca/coll/han/coll_han_reduce.c
index aae17a21fc..cce681ae2e 100644
--- a/ompi/mca/coll/han/coll_han_reduce.c
+++ b/ompi/mca/coll/han/coll_han_reduce.c
@@ -173,7 +173,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
/* Setup up t_next_seg task arguments */
t->cur_task = t_next_seg;
t->sbuf = (char *) t->sbuf + extent * t->seg_count;
- if (up_rank == root_up_rank) {
+ if (up_rank == root_up_rank && NULL != t->rbuf) {
t->rbuf = (char *) t->rbuf + extent * t->seg_count;
}
t->cur_seg = t->cur_seg + 1;