Description
Hi,
I would like to open an issue as we have seen quite unsatisfying performance using the read_parquet
function. This is our setup and data below:
- data is in S3, there are 1164 individual date-time prefixes under the main folder, and the total size of all files is barely 25.6 MB. So quite a lot of small individual files organized by individual prefixes by date-time
- they way we gather these files is by passing path:
s3://.../
and using thepartition_filter
. The function call looks like this:
wr.s3.read_parquet(
path,
dataset=True,
partition_filter=filter(),
)
I've run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter
but the gain was marginal. Enabling use_threads=True
gave no improvement. Overall it takes around 13 minutes to collect all files... this is just too long. Downloading them with aws sync
takes a few seconds.
Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet
). Please also note that we're not taking of big data here. Most of our use cases is like described above.
At the moment we're wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What's your take on that?