Skip to content

read_parquet method in chunks is reading entire dataset into memory #660

Closed
@chariottrider

Description

@chariottrider

Using wr.read_parquet(path, chunked=10) vs. wr.read_parquet(path, chunked=10000000) on an EC2 instance is consuming the same amount of memory and is basically the memory needed to hold the the entire dataset in memory.

wr.read_csv(chunksize=10) does not do this. The memory usage as you iterate on chunks is very small as expected and makes it possible to iterate on very large datasets.

pyarrow 3.0 supports iter_batches() method, which works much like wr_read_csv() in that the memory impact is very low and yet iterating in chunks is very fast.

I feel the current implementation of read_parquet with specifying an int chunksize is misleading as it seems like it should act like read_csv with chunksize and it doesn't. Not even close. Perhaps consider using pyarrow for this so that we can iterate on very large datasets efficiently.

Here's an example of working code I used for iterating on one parquet file. This could easily be wrapped to iterate on multiple files in an s3 location.

import pyarrow.parquet as pq
from smart_open import open

path_like_object = open("s3://bucket/key_to_single_file.parquet", 'rb')
#you can use smart_open, s3fs, boto or whatever lib you want to open the path as the ParquetFile class below will accept a file-like object

pfile = pq.ParquetFile(path_like_object)

for table in pfile.iter_batches(batch_size=100):
    df = table.to_pandas()
    print (df)

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingminor releaseWill be addressed in the next minor releaseready to release

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions