Skip to content

GH-46771: [Python][C++] Implement pa.arange function to generate array sequences #46778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Jun 11, 2025

Rationale for this change

When slicing arrays with non-trivial steps we were using numpy.arange to generate the indices for take. As numpy is an optional dependency, implementing it via Python caused a performance penalty. Creating a pyarrow function to build our own ranges that mimics Python range or numpy arange is useful for that uses case and might also be useful for other use cases. Currently we only generate Array[Int64] we could potentially generate more types.

What changes are included in this PR?

provide a pa.arange function that allows us to generate indices when slicing arrays.

Are these changes tested?

Yes new tests added.

Are there any user-facing changes?

No but a new pyarrow.arange function has been added.

This comment was marked as outdated.

Copy link
Member Author

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure whether this is what you had in mind @pitrou but would appreciate some tips on possible improvements. I've been testing and it stills is slightly slower than the np.arange original implementation. With the original np.arange implementation:

In [2]: a = pa.array(np.arange(0, 2_000_000))

In [3]: %timeit a[slice(1, 1_000_000, 2)]
763 μs ± 4.28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each

with the current PR one:

In [2]: a = pa.array(np.arange(0, 2_000_000))

In [3]: %timeit a[slice(1, 1_000_000, 2)]
1.08 ms ± 43.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@@ -94,6 +95,23 @@ Result<uint16_t> PyFloat_AsHalf(PyObject* obj) {
}
}

Result<std::shared_ptr<Array>> Arange(int64_t start, int64_t stop, int64_t step) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should go on src/arrow/python/helpers.cc but I am unsure on where should it belong, maybe on src/arrow/python/pyarrow.cc @pitrou what are your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing obvious comes to mind. Perhaps a new file if it doesn't fit elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it to src/arrow/python/util.{h,cc}

@raulcd raulcd changed the title [Python][C++] Implement pa.arange function to generate array sequences GH-46771: [Python][C++] Implement pa.arange function to generate array sequences Jun 11, 2025
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 11, 2025
Copy link

⚠️ GitHub issue #46771 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 11, 2025
@raulcd
Copy link
Member Author

raulcd commented Jun 11, 2025

It seems to be on-pair now:

In [2]: %timeit np.arange(1, 1_000_000, 2)
219 μs ± 10.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit pa.arange(1, 1_000_000, 2)
212 μs ± 1.97 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(10, 0, -2),
(0, 0), # Empty array
(0, 10, -1), # Empty array
(10, 0, 1), # Empty array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all cases where step is conveniently a divisor of the length, and where either start or stop is zero. Can you make these a bit more interesting and diverse?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to improve the tests scenarios with more interesting cases

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 12, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 12, 2025
@raulcd raulcd marked this pull request as ready for review June 13, 2025 07:30
@raulcd raulcd requested review from AlenkaF and rok as code owners June 13, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants