Skip to content

Make use of pyarrow iter_batches #661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 26, 2021

Conversation

maxispeicher
Copy link
Contributor

Issue #660:

Description of changes:
When available make use of the new pyarrow function iter_batches, but keep the old mechanism as a fallback for pyarrow < 3.0.0.

I've opted for explicitly checking for the availability of iter_batches instead of a try except block, because of the for loop over the batches and to make sure that the two parsing mechanisms don't get mixed up in case of an unexpected exception.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-sDRE8Pq0duHT
  • Commit ID: 04111ea
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@maxispeicher maxispeicher marked this pull request as draft April 24, 2021 11:09
@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-sDRE8Pq0duHT
  • Commit ID: e03bfc1
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@maxispeicher maxispeicher marked this pull request as ready for review April 24, 2021 16:06
@jaidisido jaidisido self-requested a review April 26, 2021 19:00
@jaidisido jaidisido added enhancement New feature or request minor release Will be addressed in the next minor release ready to release labels Apr 26, 2021
@jaidisido jaidisido added this to the 2.8.0 milestone Apr 26, 2021
Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

use_threads_flag: bool,
) -> Iterator[pa.RecordBatch]:
if chunked is True:
batch_size = 65_536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the default from the iter_batches method 👍🏼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's the pyarrow default 👍

@jaidisido jaidisido merged commit b1c79e0 into aws:main Apr 26, 2021
@jaidisido jaidisido linked an issue Apr 26, 2021 that may be closed by this pull request
@maxispeicher maxispeicher deleted the revise_parquet_chunking branch April 27, 2021 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_parquet method in chunks is reading entire dataset into memory
2 participants