Description
Using wr.read_parquet(path, chunked=10) vs. wr.read_parquet(path, chunked=10000000) on an EC2 instance is consuming the same amount of memory and is basically the memory needed to hold the the entire dataset in memory.
wr.read_csv(chunksize=10) does not do this. The memory usage as you iterate on chunks is very small as expected and makes it possible to iterate on very large datasets.
pyarrow 3.0 supports iter_batches() method, which works much like wr_read_csv() in that the memory impact is very low and yet iterating in chunks is very fast.
I feel the current implementation of read_parquet with specifying an int chunksize is misleading as it seems like it should act like read_csv with chunksize and it doesn't. Not even close. Perhaps consider using pyarrow for this so that we can iterate on very large datasets efficiently.
Here's an example of working code I used for iterating on one parquet file. This could easily be wrapped to iterate on multiple files in an s3 location.
import pyarrow.parquet as pq
from smart_open import open
path_like_object = open("s3://bucket/key_to_single_file.parquet", 'rb')
#you can use smart_open, s3fs, boto or whatever lib you want to open the path as the ParquetFile class below will accept a file-like object
pfile = pq.ParquetFile(path_like_object)
for table in pfile.iter_batches(batch_size=100):
df = table.to_pandas()
print (df)