-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45269: [C++][Compute] Add "pivot_wider" and "hash_pivot_wider" functions #45562
base: main
Are you sure you want to change the base?
Conversation
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
// Move away the states and recreate them eagerly, to make sure that any error | ||
// below does not leave us with empty states. | ||
auto states = std::move(states_); | ||
states_.resize(kernels_.size()); | ||
if (!is_last) { | ||
RETURN_NOT_OK(ResetKernelStates()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, any error in MergeAll
below could lead to a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this - shouldn't error in MergeAll
be bubbled up and stop execution? (Why does it lead to crash?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it does not actually stop execution. @zanmato1984 might know more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know... why a new scalar agg function makes an existing acero node having to change the internal logic?
Is there any test failing w/o this part of change? Maybe I can take a look and then tell if this change is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any test where the merge step returns an error could be affected. I guess other hash-aggregate functions never fail in their merge step, so this was not exercised before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, no luck, sorry. And from the code, the case PivotDuplicateValues
is using group by thus shouldn't be executing this code path. I can't speak for the validity of this change w/o truly understanding what the problem is.
I do have a suspect though: this might be race related. And if, a big if, this is the case, I'm afraid your change doesn't fix the race. However there are still many things unclear so please don't take this seriously until I have new findings.
Do you still hit the problem if you revert this part of the change? Or shall we try reverting it (maybe temporarily) to see if CI is good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And from the code, the case PivotDuplicateValues is using group by thus shouldn't be executing this code path.
Update: I was wrong about this part - the test is indeed doing the scalar agg.
(But unfortunately this doesn't change rest of the my comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can't seem to reproduce locally right now, though I did have to change this to have the tests pass at the time. I'm gonna revert temporarily so that we can check CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have a suspect though: this might be race related. And if, a big if, this is the case, I'm afraid your change doesn't fix the race. However there are still many things unclear so please don't take this seriously until I have new findings.
Update: Turns out my suspect of race won't happen actually.
So I'm at my wits end by now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, even with Valgrind I can't reproduce anymore.
Should we also add python binding for the option and call via pyarrow.compute? |
We can indeed, for the aggregate function that is. |
@github-actions crossbow submit -g cpp |
This comment was marked as outdated.
This comment was marked as outdated.
@github-actions crossbow submit -g cpp |
Revision: d97a9a7 Submitted crossbow builds: ursacomputing/crossbow @ actions-e51f9aecb5 |
I've addressed your review comments and also added Python bindings. @icexelloss @zanmato1984 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking at the implantation. Just a few questions about the function prototype so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking at the detailed code. Some comments so far.
Result<PivotWiderKeyIndex> KeyNotFound(std::string_view key_name) { | ||
if (unexpected_key_behavior_ == PivotWiderOptions::kIgnore) { | ||
return kNullPivotKey; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
DCHECK_EQ(unexpected_key_behavior_, PivotWiderOptions::kRaise); |
"Each pivot key decides in which output field the corresponding pivot value\n" | ||
"is emitted. If a pivot key doesn't appear, null is emitted.\n" | ||
"If a pivot key appears twice, KeyError is raised.\n" | ||
"Behavior of unexpected pivot keys is controlled by PivotWiderOptions."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to add some description about "duplicate values".
/// 2 | null | 13 | ||
/// 3 | 14 | null | ||
/// ``` | ||
class ARROW_EXPORT PivotWiderOptions : public FunctionOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why this is named PivotWider
? Will PivotWiden
or just Pivot
be more appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, because there is already a node named "pivot_longer".
// regarding copyright ownership. The ASF licenses this file |
Besides, I find "pivot_wider" / "pivot_longer" more explanatory than "pivot" / "melt" or other alternatives.
// | ||
// We are going to compute: | ||
// - take_indices[key = 0] = [null, 0, null, null] | ||
// - take_indices[key = 1] = [3, 2, null, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - take_indices[key = 1] = [3, 2, null, 2] | |
// - take_indices[key = 1] = [3, 2, null, 1] |
// Populate the take_indices for each output column | ||
for (int64_t i = 0; i < values.length; ++i) { | ||
const PivotWiderKeyIndex key = keys[i]; | ||
const uint32_t group = groups[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be moved into the following if
?
"All output struct fields have the same type as `pivot_values`.\n" | ||
"Each pivot key decides in which output field the corresponding pivot value\n" | ||
"is emitted. If a pivot key doesn't appear in a given group, null is emitted.\n" | ||
"If a pivot key appears twice in a given group, KeyError is raised.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, add description about "duplicate value".
Rationale for this change
Add "pivot wider" functionality such as in Pandas, through two dedicated functions:
Both functions take two arguments (the column of pivot keys and the column of pivot values) and require passing a
PivotWiderOptions
structure with the expected pivot keys, so as to determine the output Struct type.Are these changes tested?
Yes, by dedicated unit tests.
Are there any user-facing changes?
No, just new APIs.