Skip to content

Problem with rbuf pointer in HAN's reduce #11650

Closed
@gkatev

Description

@gkatev

Hi, while testing the new HAN+XHC integration method, I came upon a bug/issue in HAN's reduce, in this part of the code:

if (up_rank == root_up_rank) {
t->rbuf = (char *) t->rbuf + extent * t->seg_count;
}

This condition will be true for non-root ranks in the same node as the root. But, for these ranks, rbuf has been previously initialized to NULL. Thus, t->rbuf will be set to something non-NULL, but won't point to valid memory. Later on, this influences the rbuf parameter to coll_reduce:

if (t->is_tmp_rbuf) {
tmp_rbuf = (char*)t->rbuf + (next_seg % 2)*(extent * t->seg_count);
} else if (NULL != t->rbuf) {
tmp_rbuf = (char*)t->rbuf + extent * t->seg_count;
}
t->low_comm->c_coll->coll_reduce((char *) t->sbuf + extent * t->seg_count,
(char *) tmp_rbuf, tmp_count,
t->dtype, t->op, t->root_low_rank, t->low_comm,
t->low_comm->c_coll->coll_reduce_module);

This is a problem when trying to detect (in XHC) if rbuf is valid or not, as this is done by checking if the pointer is NULL.
Related: #11552 and the discussion in #11418

I'm posting this as an issue instead of a PR, with the hopes that someone will take it through the last mile, as I'm not fully sure what the desired fix for this would be, or if there are similar occurrences in other HAN collectives that should also be adjusted. Something like this does fix it:

diff --git a/ompi/mca/coll/han/coll_han_reduce.c b/ompi/mca/coll/han/coll_han_reduce.c
index aae17a21fc..cce681ae2e 100644
--- a/ompi/mca/coll/han/coll_han_reduce.c
+++ b/ompi/mca/coll/han/coll_han_reduce.c
@@ -173,7 +173,7 @@ mca_coll_han_reduce_intra(const void *sbuf,
         /* Setup up t_next_seg task arguments */
         t->cur_task = t_next_seg;
         t->sbuf = (char *) t->sbuf + extent * t->seg_count;
-        if (up_rank == root_up_rank) {
+        if (up_rank == root_up_rank && NULL != t->rbuf) {
             t->rbuf = (char *) t->rbuf + extent * t->seg_count;
         }
         t->cur_seg = t->cur_seg + 1;

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions