Make use of pyarrow iter_batches #661

maxispeicher · 2021-04-24T10:08:16Z

Issue #660:

Description of changes:
When available make use of the new pyarrow function iter_batches, but keep the old mechanism as a fallback for pyarrow < 3.0.0.

I've opted for explicitly checking for the availability of iter_batches instead of a try except block, because of the for loop over the batches and to make sure that the two parsing mechanisms don't get mixed up in case of an unexpected exception.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jaidisido · 2021-04-24T10:22:32Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-sDRE8Pq0duHT
Commit ID: 04111ea
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido · 2021-04-24T12:59:21Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-sDRE8Pq0duHT
Commit ID: e03bfc1
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

jaidisido

LGTM!

jaidisido · 2021-04-26T19:02:03Z

awswrangler/s3/_read_parquet.py

+    use_threads_flag: bool,
+) -> Iterator[pa.RecordBatch]:
+    if chunked is True:
+        batch_size = 65_536


I assume this is the default from the iter_batches method 👍🏼

Yes it's the pyarrow default 👍

Make use of pyarrow iter_batches

04111ea

maxispeicher marked this pull request as draft April 24, 2021 11:09

Refactor new code

e03bfc1

maxispeicher marked this pull request as ready for review April 24, 2021 16:06

jaidisido self-requested a review April 26, 2021 19:00

jaidisido assigned maxispeicher Apr 26, 2021

jaidisido added enhancement New feature or request minor release Will be addressed in the next minor release ready to release labels Apr 26, 2021

jaidisido added this to the 2.8.0 milestone Apr 26, 2021

jaidisido approved these changes Apr 26, 2021

View reviewed changes

jaidisido merged commit b1c79e0 into aws:main Apr 26, 2021

jaidisido linked an issue Apr 26, 2021 that may be closed by this pull request

read_parquet method in chunks is reading entire dataset into memory #660

Closed

maxispeicher deleted the revise_parquet_chunking branch April 27, 2021 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make use of pyarrow iter_batches #661

Make use of pyarrow iter_batches #661

Uh oh!

maxispeicher commented Apr 24, 2021

Uh oh!

jaidisido commented Apr 24, 2021

Uh oh!

jaidisido commented Apr 24, 2021

Uh oh!

jaidisido left a comment

Uh oh!

jaidisido Apr 26, 2021

Uh oh!

maxispeicher Apr 27, 2021

Uh oh!

Uh oh!

Make use of pyarrow iter_batches #661

Make use of pyarrow iter_batches #661

Uh oh!

Conversation

maxispeicher commented Apr 24, 2021

Uh oh!

jaidisido commented Apr 24, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido commented Apr 24, 2021

AWS CodeBuild CI Report

Uh oh!

jaidisido left a comment

Choose a reason for hiding this comment

Uh oh!

jaidisido Apr 26, 2021

Choose a reason for hiding this comment

Uh oh!

maxispeicher Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!