Skip to content

Return valid for all-nulls in reduce() with nunique include-nulls aggregation #19196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: branch-25.08
Choose a base branch
from

Conversation

davidwendt
Copy link
Contributor

Description

Adds specialized handling of nunique aggregation with include-nulls setting for cudf::reduce() when the input column is all nulls. This is consistent with cudf::distinct_count() result.

Closes #19184

The reductions.cpp code was reworked with a utility function to handle all of the many empty/all-null cases for various aggregrations that cudf::reduce() supports. Also, the aggregate-dispatcher call was removed since all agg-kinds were executed by a single functor operator with a switch statement.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt self-assigned this Jun 18, 2025
@davidwendt davidwendt requested a review from a team as a code owner June 18, 2025 15:35
@davidwendt davidwendt added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jun 18, 2025
data_type output_dtype,
std::optional<std::reference_wrapper<scalar const>> init,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has not changed. The functor operator() was simply changed to a regular function call.
So the change just moved the code logic to the left.

column_view col,
data_type output_dtype,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just consolidates the many cases that were in the if-empty-all-nulls statement in the reduce() function below.
The main new change is the case NUNIQUE:

@TomAugspurger
Copy link
Contributor

Thanks @davidwendt. Do you think the docs corresponding to https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/aggregation_reduction/ need to be updated? Specifically the bit you pointed out yesterday:

If the column is empty or contains all null entries col.size()==col.null_count(), the output scalar value will be false for reduction type any and true for reduction type all. For all other reductions, the output scalar returns with is_valid()==false.

@davidwendt
Copy link
Contributor Author

Thanks @davidwendt. Do you think the docs corresponding to https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/aggregation_reduction/ need to be updated? Specifically the bit you pointed out yesterday:

Yes, I could use some advice on that. I don't know about listing all the special cases there.
Most of the time we still return None/null so I'm thinking of adding something vague like

For empty or all-null input, the result is generally a null scalar except for certain specific aggregations.

I suppose I could list the aggregations without specifically mentioning the result. Or put in a table with the results though that could get complicated.

@TomAugspurger
Copy link
Contributor

I like your suggestion. Maybe modified slightly

For empty or all-null input, the result is generally a null scalar except for specific aggregations where the aggregation has a well-defined output for an empty input.

@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels Jun 18, 2025
@davidwendt
Copy link
Contributor Author

Marking this as a breaking change since the returned result has changed for the specific input case for the specific aggregation.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach makes sense. As far as docs, a table would be nice, but it would probably get out of date. The fancy solution I would use for a Python library that doesn't require GPUs to run would be to inject the values during doc build by executing the necessary code, but since that's not something we can easily do I am fine sticking with what you have written.

@vuule
Copy link
Contributor

vuule commented Jun 24, 2025

How come a test for the affected case is not included in this PR? (ignore if I missed a discussion about this)

@davidwendt
Copy link
Contributor Author

How come a test for the affected case is not included in this PR? (ignore if I missed a discussion about this)

You are right. I should include a gtest for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Bug in pylibcudf.aggregation.nunique with all-null array.
5 participants