read_parquet method in chunks is reading entire dataset into memory

Using wr.read_parquet(path, chunked=10) vs. wr.read_parquet(path, chunked=10000000) on an EC2 instance is consuming the same amount of memory and is basically the memory needed to hold the the entire dataset in memory.

wr.read_csv(chunksize=10) does not do this. The memory usage as you iterate on chunks is very small as expected and makes it possible to iterate on very large datasets.

pyarrow 3.0 supports [iter_batches() method](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches), which works much like wr_read_csv() in that the memory impact is very low and yet iterating in chunks is very fast.

I feel the current implementation of read_parquet with specifying an int chunksize is misleading as it seems like it should act like read_csv with chunksize and it doesn't. Not even close. Perhaps consider using pyarrow for this so that we can iterate on very large datasets efficiently.

Here's an example of working code I used for iterating on one parquet file. This could easily be wrapped to iterate on multiple files in an s3 location.

```
import pyarrow.parquet as pq
from smart_open import open

path_like_object = open("s3://bucket/key_to_single_file.parquet", 'rb')
#you can use smart_open, s3fs, boto or whatever lib you want to open the path as the ParquetFile class below will accept a file-like object

pfile = pq.ParquetFile(path_like_object)

for table in pfile.iter_batches(batch_size=100):
    df = table.to_pandas()
    print (df)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

read_parquet method in chunks is reading entire dataset into memory #660

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_parquet method in chunks is reading entire dataset into memory #660

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions