Add `bulk_read` option for reading large amounts of Parquet files quickly #2033

LeonLuttenberger · 2023-02-17T18:04:17Z

Feature or Bugfix

Feature

Detail

Added a bulk_read_parquet parameter which will use the newly implemented ArrowParquetBaseDatasource when reading Parquet files. This won't check for any schema compatibility, resulting in significantly faster reading time. However, in order for that to work, the Parquet files must be uniform.
Changed the behavior of validate_schema so that it doesn't go through the schemas unless validate_schema=True.

The changes above have resulted in the following differences in performance when reading 1111 objects from S3:

validate_schema=True and bulk_read_parquet=False: 76 seconds (this has been the default behavior thus far)
validate_schema=False and bulk_read_parquet=False (AKA the closest equivalent to 1. ray.data.read_parquet(path).to_modin()): 50 seconds
bulk_read_parquet=True: 18 seconds

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ase-lutleon

jaidisido

Looks great, just a couple of comments

awswrangler/distributed/ray/datasources/arrow_parquet_base_datasource.py

awswrangler/s3/_read_parquet.py

…ase-lutleon

malachi-constant · 2023-02-27T21:14:27Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 5d72d4d
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2023-02-27T21:20:34Z

AWS CodeBuild CI Report

CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
Commit ID: 5d72d4d
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2023-02-27T22:06:47Z

AWS CodeBuild CI Report

CodeBuild project: GitHubLoadTests5656BB24-ATYtnXPE7MOa
Commit ID: 5d72d4d
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

kukushking

Nice!

kukushking and others added 6 commits February 16, 2023 15:13

Checkpoint

a600a34

fix formatting

50d6b10

Add argument for bulk_read_parquet

27d93a1

Add bulk_read_parquet case to test_s3_read_parquet_simple

f99a609

Fix read_table

21da07b

Merge branch 'release-3.0.0' into dist/optimize-parquet-dataset-add-b…

9eb59cb

…ase-lutleon

This comment was marked as outdated.

Sign in to view

fix format

ae912c4

This comment was marked as outdated.

Sign in to view

Add amazon reviews test

a94bb19

This comment was marked as outdated.

Sign in to view

remove extra skip

5fdb85f

This comment was marked as outdated.

Sign in to view

add dataset=True

ff2f3f7

This comment was marked as outdated.

Sign in to view

LeonLuttenberger added the performance label Feb 24, 2023

LeonLuttenberger added this to the 3.0.0 milestone Feb 24, 2023

LeonLuttenberger linked an issue Feb 24, 2023 that may be closed by this pull request

(@scale): Reading a large number of small S3 objects is slow and might eventually fail #1982

Closed

4 tasks

This comment was marked as outdated.

Sign in to view

jaidisido reviewed Feb 24, 2023

View reviewed changes

awswrangler/distributed/ray/datasources/arrow_parquet_base_datasource.py Show resolved Hide resolved

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved

awswrangler/s3/_read_parquet.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

This comment was marked as duplicate.

Sign in to view

LeonLuttenberger changed the title ~~[DRAFT] Switch between two Parquet data sources~~ Add bulk_read option for reading large amounts of Parquet files quickly Feb 24, 2023

LeonLuttenberger marked this pull request as ready for review February 24, 2023 19:25

malachi-constant approved these changes Feb 24, 2023

View reviewed changes

Merge branch 'release-3.0.0' into dist/optimize-parquet-dataset-add-b…

bca43f5

…ase-lutleon

LeonLuttenberger mentioned this pull request Feb 27, 2023

Add ray_modin_args parameter for encapsulating distributed args #2063

Closed

Rename bulk_read

cd3faa0

This comment was marked as outdated.

Sign in to view

Fix test_s3_read_parquet_many_files

5d72d4d

This comment was marked as outdated.

Sign in to view

jaidisido approved these changes Feb 28, 2023

View reviewed changes

kukushking approved these changes Feb 28, 2023

View reviewed changes

LeonLuttenberger merged commit 04956d9 into release-3.0.0 Feb 28, 2023

LeonLuttenberger deleted the dist/optimize-parquet-dataset-add-base-lutleon branch February 28, 2023 14:31

jaidisido linked an issue Mar 2, 2023 that may be closed by this pull request

Consider using ray.data.read_parquet_bulk when possible #2023

Closed

jaidisido mentioned this pull request Mar 9, 2023

(@scale): Reading a large number of small S3 objects is slow and might eventually fail #1982

Closed

4 tasks

Add bulk_read option for reading large amounts of Parquet files quickly #2033

Add bulk_read option for reading large amounts of Parquet files quickly #2033

Uh oh!

Conversation

LeonLuttenberger commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature or Bugfix

Detail

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jaidisido left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as duplicate.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

malachi-constant commented Feb 27, 2023

AWS CodeBuild CI Report

Uh oh!

malachi-constant commented Feb 27, 2023

AWS CodeBuild CI Report

Uh oh!

This comment was marked as outdated.

malachi-constant commented Feb 27, 2023

AWS CodeBuild CI Report

Uh oh!

kukushking left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Add `bulk_read` option for reading large amounts of Parquet files quickly #2033

Add `bulk_read` option for reading large amounts of Parquet files quickly #2033

LeonLuttenberger commented Feb 17, 2023 •

edited

Loading