Basic sorting support with Dask #256

rjzamora · 2025-05-07T22:31:35Z

Follow up to #249

Adds minimal changes to support sorting with a DaskIntegration protocol.
Does not bother to implement/demonstrate anything beyond an ascending sort by a single column.
Adds sort_boundaries and options arguments to DaskIntegration.insert_partition and adds an options argument to DaskIntegration.extract_partition
- The sort_boundaries must be a separate positional argument so that it can be generated dynamically as the output of a separate task. All other sorting options can be passed through with options (e.g. ascending vs descending and null precedence).

This technically breaks existing cudf-polars code, so we will want to update the DaskIntegration protocol defined in cudf asap after this is finished/merged.

TomAugspurger · 2025-05-08T00:40:02Z

python/rapidsmpf/rapidsmpf/examples/dask.py

+        sort_boundaries
+            Output partition boundaries for sorting.
+        options
+            Optional key-work arguments.


Suggested change

Optional key-work arguments.

Optional key-word arguments.

And we use "Additional options." in a few places below. Let's pick one description and copy it through.

Ah yeah, missed this one - Let's do the simple "Additional options." for now.

Would it make sense to make it DaskIntegration(on=, sort_boundaries=)? Or would that obfuscate/impede the way we build dask graphs here?

More of a question, as it would side-step the immediate need for breaking things (not that it matters) and might avoid the options catch-all.

I think it would be good to pass on the pid of df here for sorting. That is needed to find the right splits if you want to balance the result partition sizes for degenerate case (such as all equal).
And I believe we don't have another way to pass it in.

@seberg - I updated/generalized the protocol a bit. I didn't include the input partition id as a required argument, but we can add that now that we are changing things. Can you explain how having the input partition id would help you handle degenerate values?

Basically, the idea is that the split_boundary values know which partition ID they came from (and ideally their local row).

For example, we split (1, 1, 1, 1), distributed as pid0=(1, 1) and pid1=(1, 1).
If you the pid and row, then the split boundary will be (value, pid=1, row=0).

With that pid information, you can figure out now here that pid=0 should send it's data to 0 (split after the boundary) and pid=1 should send it all to 1 (split before at boundary here).

Without the additional information, there is no choice but for both pids to send all data to 0.

Okay, I see. This case definitely isn't a high priority yet (dask-dataframe still doesn't attempt to handle this at all), but it's good-enough reason to include partition_id as a required argument to insert_partition now that we are updating the protocol anyway.

python/rapidsmpf/rapidsmpf/examples/dask.py

python/rapidsmpf/rapidsmpf/tests/test_dask.py

seberg

Looks good to me, although my Dask eye isn't keen. Two small comments that may or may not be relevant.

seberg · 2025-05-08T07:48:26Z

python/rapidsmpf/rapidsmpf/examples/dask.py

+        sort_boundaries
+            Output partition boundaries for sorting.
+        options
+            Optional key-work arguments.


Would it make sense to make it DaskIntegration(on=, sort_boundaries=)? Or would that obfuscate/impede the way we build dask graphs here?

More of a question, as it would side-step the immediate need for breaking things (not that it matters) and might avoid the options catch-all.

seberg · 2025-05-08T07:52:28Z

python/rapidsmpf/rapidsmpf/examples/dask.py

+            )
+        else:
+            df = df.sort_values(on)
+            splits = df[on[0]].searchsorted(sort_boundaries, side="right")


N.B. (I assume you are aware, and at most worth a code comment): Good for an example but it only works if values in sort_boundaries are unique in df. Otherwise you need to adjust for where the boundary value came from. Thus the longer function I shared.

EDIT: Sorry, this is not as bad as I first recalled. As it is only needed to avoid large imbalances in the result partition sizes.

wence- · 2025-05-08T11:02:03Z

python/rapidsmpf/rapidsmpf/integrations/dask/shuffler.py

+            Output partition boundaries for sorting. If None,
+            hashing will be used to calculate output partitions.
+        options
+            Additional options.
        """


I can imagine that we might eventually want more all-to-all-like patterns. Would it make more sense to change this interface such that insert_partition just takes the list[PackedData] and the shuffler and we provide separate functions for hash and sort-based partitioning (and the user can bring their own).

So something like:

def insert_partition( shuffler: Shuffler, chunks: Sequence[PackedData], # Or whatever it is ) -> None:

And we provide two builtin functions

def hash_partition(df, partition_count, *, on) -> list[PackedData]: ... def sort_partition(df, partition_count, *, by) -> list[PackedData]: ...

I don't think that helps us generalize at all. We already have Shuffler.insert_chunks, which is essentially the insert_partition function you are proposing. The purpose of DaskIntegration.insert_partition is to aviod the need for various Dask shuffling applications to write their own task graph.

We want insert_partition/extract_partition to include the minimal necessary arguments to construct a "general" shuffling task graph. Since we are revising things, this may be:

@staticmethod def insert_partition( df: DataFrameT, # Partition to insert partition_count: int, # Output partition count shuffler: Shuffler, # Shuffler object options: dict[str, Any] | None, # Arbitrary keyword arguments *other: Any, # "Other" task-output data (e.g. sorting boundaries/quantiles) ) -> None: @staticmethod def extract_partition( partition_id: int, # Partition ID to extract shuffler: Shuffler, # Shuffler object options: dict[str, Any] | None, # Arbitrary keyword arguments ) -> DataFrameT:

I think the options argument can be used to control most variation of a shuffle, and the *other positional argument could be used to pass in information that must be calculated dynamically at execution time.

Ah, ok, carry on then

python/rapidsmpf/rapidsmpf/integrations/dask/shuffler.py

TomAugspurger

I think this should be good. I think our quickstart example will need to be updated for the new keyword.

@wence-'s comment in https://github.com/rapidsai/rapidsmpf/pull/256/files#r2079496979 is worth resolving one way or another. Aside from churn, I think it can be handled later, but I don't have a strong opinion on it.

python/rapidsmpf/rapidsmpf/examples/dask.py

rjzamora · 2025-05-08T20:34:05Z

Thanks @TomAugspurger

@wence-'s comment in https://github.com/rapidsai/rapidsmpf/pull/256/files#r2079496979 is worth resolving one way or another. Aside from churn, I think it can be handled later, but I don't have a strong opinion on it.

Could you summarize what still needs to be resolved, and maybe we can file an issue? As far as I can tell, the suggestions from that comment already exist in rapidsmpf (via DaskIntegration.insert_partition, split_and_pack, and partition_and_pack) - That said, I'm probably missing something.

TomAugspurger · 2025-05-08T20:42:06Z

Yep that makes sense to me. Let's go ahead and merge this and open a followup issue if @wence- has anything to add.

rjzamora · 2025-05-08T22:01:13Z

/merge

Teeing up this "fix" for the proposed change in rapidsai/rapidsmpf#256 Once that PR is merged, we will want to get this in asap to keep `rapidsmpf` shuffling from breaking. We can update `Sort` in a follow-up PR. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) URL: #18720

rjzamora added 12 commits May 2, 2025 09:46

save work

952d90d

Merge remote-tracking branch 'upstream/branch-25.06' into dask-sort

a8158dc

save changes

824c355

Merge remote-tracking branch 'upstream/branch-25.06' into dask-sort

8d7f980

basic test working

fb7f157

combine APIs for now

f8ecd9d

fix typo

6284fe3

add final sort

fb21d11

fix sort

e6421fa

add options check

44d1d98

add options check

fca9bc2

tweak signature

43224e7

rjzamora self-assigned this May 7, 2025

rjzamora requested a review from a team as a code owner May 7, 2025 22:31

rjzamora added breaking Introduces a breaking change feature request New feature or request labels May 7, 2025

TomAugspurger reviewed May 8, 2025

View reviewed changes

seberg approved these changes May 8, 2025

View reviewed changes

wence- reviewed May 8, 2025

View reviewed changes

rjzamora added 3 commits May 8, 2025 06:28

Merge remote-tracking branch 'upstream/branch-25.06' into dask-sort

d101d73

revise and simplify the DaskIntegration protocol

b7981da

add partition_id to the protocol

e4626ed

rjzamora mentioned this pull request May 8, 2025

Revise DaskIntegration protocol to align with rapidsmpf rapidsai/cudf#18720

Merged

3 tasks

TomAugspurger reviewed May 8, 2025

View reviewed changes