Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45269: [C++][Compute] Add "pivot_wider" and "hash_pivot_wider" functions #45562

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Feb 18, 2025

Rationale for this change

Add "pivot wider" functionality such as in Pandas, through two dedicated functions:

  1. a "pivot_wider" scalar aggregate function returning a Struct scalar
  2. a "hash_pivot_wider" grouped aggregate function returning a Struct array

Both functions take two arguments (the column of pivot keys and the column of pivot values) and require passing a PivotWiderOptions structure with the expected pivot keys, so as to determine the output Struct type.

Are these changes tested?

Yes, by dedicated unit tests.

Are there any user-facing changes?

No, just new APIs.

@pitrou pitrou marked this pull request as ready for review February 18, 2025 12:52
@pitrou pitrou requested a review from westonpace as a code owner February 18, 2025 12:52
@pitrou pitrou requested review from zanmato1984 and removed request for westonpace February 18, 2025 12:52
@pitrou
Copy link
Member Author

pitrou commented Feb 18, 2025

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

Comment on lines 297 to 303
// Move away the states and recreate them eagerly, to make sure that any error
// below does not leave us with empty states.
auto states = std::move(states_);
states_.resize(kernels_.size());
if (!is_last) {
RETURN_NOT_OK(ResetKernelStates());
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, any error in MergeAll below could lead to a crash.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this - shouldn't error in MergeAll be bubbled up and stop execution? (Why does it lead to crash?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it does not actually stop execution. @zanmato1984 might know more about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... why a new scalar agg function makes an existing acero node having to change the internal logic?

Is there any test failing w/o this part of change? Maybe I can take a look and then tell if this change is right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any test where the merge step returns an error could be affected. I guess other hash-aggregate functions never fail in their merge step, so this was not exercised before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, no luck, sorry. And from the code, the case PivotDuplicateValues is using group by thus shouldn't be executing this code path. I can't speak for the validity of this change w/o truly understanding what the problem is.

I do have a suspect though: this might be race related. And if, a big if, this is the case, I'm afraid your change doesn't fix the race. However there are still many things unclear so please don't take this seriously until I have new findings.

Do you still hit the problem if you revert this part of the change? Or shall we try reverting it (maybe temporarily) to see if CI is good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And from the code, the case PivotDuplicateValues is using group by thus shouldn't be executing this code path.

Update: I was wrong about this part - the test is indeed doing the scalar agg.

(But unfortunately this doesn't change rest of the my comment.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't seem to reproduce locally right now, though I did have to change this to have the tests pass at the time. I'm gonna revert temporarily so that we can check CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have a suspect though: this might be race related. And if, a big if, this is the case, I'm afraid your change doesn't fix the race. However there are still many things unclear so please don't take this seriously until I have new findings.

Update: Turns out my suspect of race won't happen actually.

So I'm at my wits end by now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, even with Valgrind I can't reproduce anymore.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 18, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 25, 2025
@icexelloss
Copy link
Contributor

Should we also add python binding for the option and call via pyarrow.compute?

@pitrou
Copy link
Member Author

pitrou commented Feb 25, 2025

Should we also add python binding for the option and call via pyarrow.compute?

We can indeed, for the aggregate function that is.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 26, 2025
@zanmato1984
Copy link
Contributor

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou
Copy link
Member Author

pitrou commented Feb 26, 2025

@github-actions crossbow submit -g cpp

Copy link

Revision: d97a9a7

Submitted crossbow builds: ursacomputing/crossbow @ actions-e51f9aecb5

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Feb 26, 2025

I've addressed your review comments and also added Python bindings. @icexelloss @zanmato1984

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking at the implantation. Just a few questions about the function prototype so far.

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking at the detailed code. Some comments so far.

Result<PivotWiderKeyIndex> KeyNotFound(std::string_view key_name) {
if (unexpected_key_behavior_ == PivotWiderOptions::kIgnore) {
return kNullPivotKey;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}
DCHECK_EQ(unexpected_key_behavior_, PivotWiderOptions::kRaise);

"Each pivot key decides in which output field the corresponding pivot value\n"
"is emitted. If a pivot key doesn't appear, null is emitted.\n"
"If a pivot key appears twice, KeyError is raised.\n"
"Behavior of unexpected pivot keys is controlled by PivotWiderOptions."),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add some description about "duplicate values".

/// 2 | null | 13
/// 3 | 14 | null
/// ```
class ARROW_EXPORT PivotWiderOptions : public FunctionOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this is named PivotWider? Will PivotWiden or just Pivot be more appropriate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, because there is already a node named "pivot_longer".

// regarding copyright ownership. The ASF licenses this file

Besides, I find "pivot_wider" / "pivot_longer" more explanatory than "pivot" / "melt" or other alternatives.

//
// We are going to compute:
// - take_indices[key = 0] = [null, 0, null, null]
// - take_indices[key = 1] = [3, 2, null, 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// - take_indices[key = 1] = [3, 2, null, 2]
// - take_indices[key = 1] = [3, 2, null, 1]

// Populate the take_indices for each output column
for (int64_t i = 0; i < values.length; ++i) {
const PivotWiderKeyIndex key = keys[i];
const uint32_t group = groups[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be moved into the following if?

"All output struct fields have the same type as `pivot_values`.\n"
"Each pivot key decides in which output field the corresponding pivot value\n"
"is emitted. If a pivot key doesn't appear in a given group, null is emitted.\n"
"If a pivot key appears twice in a given group, KeyError is raised.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, add description about "duplicate value".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants