You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
I'm working with 50 Parquet files (each file is of ~800MB with ~380000 rows and ~8 columns). I need to perform a grouped summarisation in R. Something like:
Here, pivot_wider() is not available via arrow. So just before it, I need to collect() the data as a dataframe and then apply pivot_wider(). As soon as I apply collect(), I encounter memory issues (core dumped, bad_alloc). What is the best way to handle this large data without running into memory errors? The experimental batch processing seemed like an option, but I will not be able to make batches by random sub-setting. Rather, it would be ideal to subset via the group_by columns.
I was trying out listing all the possible groups and then using mclapply to process the data.
Describe the usage question you have. Please include as many useful details as possible.
I'm working with 50 Parquet files (each file is of ~800MB with ~380000 rows and ~8 columns). I need to perform a grouped summarisation in R. Something like:
Here,
pivot_wider()
is not available via arrow. So just before it, I need tocollect()
the data as a dataframe and then applypivot_wider()
. As soon as I applycollect()
, I encounter memory issues (core dumped, bad_alloc). What is the best way to handle this large data without running into memory errors? The experimental batch processing seemed like an option, but I will not be able to make batches by random sub-setting. Rather, it would be ideal to subset via thegroup_by
columns.I was trying out listing all the possible groups and then using
mclapply
to process the data.But this means it will repeatedly interact with the files on disk, I suppose. Is there a more efficient way to do this?
Component(s)
R
The text was updated successfully, but these errors were encountered: