-
Notifications
You must be signed in to change notification settings - Fork 706
Add bulk_read
option for reading large amounts of Parquet files quickly
#2033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bulk_read
option for reading large amounts of Parquet files quickly
#2033
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a couple of comments
awswrangler/distributed/ray/datasources/arrow_parquet_base_datasource.py
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as duplicate.
This comment was marked as duplicate.
bulk_read
option for reading large amounts of Parquet files quickly
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
This comment was marked as outdated.
This comment was marked as outdated.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Feature or Bugfix
Detail
bulk_read_parquet
parameter which will use the newly implementedArrowParquetBaseDatasource
when reading Parquet files. This won't check for any schema compatibility, resulting in significantly faster reading time. However, in order for that to work, the Parquet files must be uniform.validate_schema
so that it doesn't go through the schemas unlessvalidate_schema=True
.The changes above have resulted in the following differences in performance when reading 1111 objects from S3:
validate_schema=True
andbulk_read_parquet=False
: 76 seconds (this has been the default behavior thus far)validate_schema=False
andbulk_read_parquet=False
(AKA the closest equivalent to 1.ray.data.read_parquet(path).to_modin()
): 50 secondsbulk_read_parquet=True
: 18 secondsBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.