Skip to content

Slow performance when using read_parquet from s3 #644

Closed
@konradsemsch

Description

@konradsemsch

Hi,

I would like to open an issue as we have seen quite unsatisfying performance using the read_parquet function. This is our setup and data below:

  • data is in S3, there are 1164 individual date-time prefixes under the main folder, and the total size of all files is barely 25.6 MB. So quite a lot of small individual files organized by individual prefixes by date-time
  • they way we gather these files is by passing path: s3://.../ and using the partition_filter. The function call looks like this:
wr.s3.read_parquet(
                path,
                dataset=True,
                partition_filter=filter(),
            )

I've run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter but the gain was marginal. Enabling use_threads=True gave no improvement. Overall it takes around 13 minutes to collect all files... this is just too long. Downloading them with aws sync takes a few seconds.

Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet). Please also note that we're not taking of big data here. Most of our use cases is like described above.

At the moment we're wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What's your take on that?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions