Slow performance when using read_parquet from s3

Hi,

I would like to open an issue as we have seen quite unsatisfying performance using the `read_parquet` function. This is our setup and data below:

* data is in S3, there are 1164 individual date-time prefixes under the main folder, and the total size of all files is barely 25.6 MB. So quite a lot of small individual files organized by individual prefixes by date-time
* they way we gather these files is by passing path: `s3://.../` and using the `partition_filter`. The function call looks like this:

```python
wr.s3.read_parquet(
                path,
                dataset=True,
                partition_filter=filter(),
            )
```
I've run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the `partition_filter` but the gain was marginal. Enabling `use_threads=True` gave no improvement. Overall it takes around 13 minutes to collect all files... this is just too long. Downloading them with `aws sync` takes a few seconds.

Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the `wr.s3.read_parquet`). Please also note that we're not taking of big data here. Most of our use cases is like described above.

At the moment we're wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What's your take on that?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow performance when using read_parquet from s3 #644

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow performance when using read_parquet from s3 #644

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions