You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement split_and_shuffle, as a building block for sort-shuffling
Implementing a distributed sort has the following components
(more complex things possibly exists, but I think this should be OK):
LocalSort -> SelectEquidistant/SplitPointValues -> BroadcastToAll
(Split points are selected by the number of partitions.)
The BroadcastToAll result will then contain:
1. The original data (by which we sorted, no need for all rows)
2. Two additional rows with (partition_id, local_row_id) to establish
a global order.
The `BroadcastToAll` result can then be sorted again and used to
figure out how to move each split around.
And for that step one needs something like the `split_and_pack()` to
replace the `partition_and_pack()` currently used for hash partitioning.
So with the above one should be able to implement a
ShuffleForSort(LocalSort, BroadcastToAll).
The final sort result would then be another local sort after shuffling.
(The needed broadcast is small, so I assume there is no need to do it
with rapidsmpf initially.)
The above approach guarantees that the result partitions are at most
a factor of two less balanced than the input (I can look up the
reference).
Signed-off-by: Sebastian Berg <[email protected]>
0 commit comments