-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46771: [Python][C++] Implement pa.arange function to generate array sequences #46778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure whether this is what you had in mind @pitrou but would appreciate some tips on possible improvements. I've been testing and it stills is slightly slower than the np.arange
original implementation. With the original np.arange
implementation:
In [2]: a = pa.array(np.arange(0, 2_000_000))
In [3]: %timeit a[slice(1, 1_000_000, 2)]
763 μs ± 4.28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each
with the current PR one:
In [2]: a = pa.array(np.arange(0, 2_000_000))
In [3]: %timeit a[slice(1, 1_000_000, 2)]
1.08 ms ± 43.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
@@ -94,6 +95,23 @@ Result<uint16_t> PyFloat_AsHalf(PyObject* obj) { | |||
} | |||
} | |||
|
|||
Result<std::shared_ptr<Array>> Arange(int64_t start, int64_t stop, int64_t step) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should go on src/arrow/python/helpers.cc
but I am unsure on where should it belong, maybe on src/arrow/python/pyarrow.cc
@pitrou what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing obvious comes to mind. Perhaps a new file if it doesn't fit elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved it to src/arrow/python/util.{h,cc}
|
It seems to be on-pair now: In [2]: %timeit np.arange(1, 1_000_000, 2)
219 μs ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [3]: %timeit pa.arange(1, 1_000_000, 2)
212 μs ± 1.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) |
python/pyarrow/tests/test_array.py
Outdated
(10, 0, -2), | ||
(0, 0), # Empty array | ||
(0, 10, -1), # Empty array | ||
(10, 0, 1), # Empty array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all cases where step
is conveniently a divisor of the length, and where either start
or stop
is zero. Can you make these a bit more interesting and diverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to improve the tests scenarios with more interesting cases
…e interesting test cases
Rationale for this change
When slicing arrays with non-trivial steps we were using
numpy.arange
to generate the indices for take. As numpy is an optional dependency, implementing it via Python caused a performance penalty. Creating a pyarrow function to build our own ranges that mimics Python range or numpy arange is useful for that uses case and might also be useful for other use cases. Currently we only generateArray[Int64]
we could potentially generate more types.What changes are included in this PR?
provide a
pa.arange
function that allows us to generate indices when slicing arrays.Are these changes tested?
Yes new tests added.
Are there any user-facing changes?
No but a new pyarrow.arange function has been added.