Handling large data #45645

ag1805x · 2025-03-01T07:09:29Z

Describe the usage question you have. Please include as many useful details as possible.

I'm working with 50 Parquet files (each file is of ~800MB with ~380000 rows and ~8 columns). I need to perform a grouped summarisation in R. Something like:

group_by(sample_id, gene1, gene1) %>% 
  summarise(mean_importance = mean(importance), 
            mean_count = mean(n_count)) %>%
  pivot_wider(names_from = "sample_id", 
              values_from = c("mean_importance", "mean_count"), 
              names_sep = "__")

Here, pivot_wider() is not available via arrow. So just before it, I need to collect() the data as a dataframe and then apply pivot_wider(). As soon as I apply collect(), I encounter memory issues (core dumped, bad_alloc). What is the best way to handle this large data without running into memory errors? The experimental batch processing seemed like an option, but I will not be able to make batches by random sub-setting. Rather, it would be ideal to subset via the group_by columns.

I was trying out listing all the possible groups and then using mclapply to process the data.

pq_files <- list.files("path", full.names = TRUE)
pq_files <- open_dataset(sources = pq_files)

grp_list <- expand_grid("gene1" = gene1, 
                         "gene2" = gene2) %>% 
  filter(gene1 != gene2)

res <- mclapply(X = 1:nrow(grp_list), 
                  mc.cores = 60,
                  FUN = function(i){
                    
                    pq_files %>% 
                      filter((gene1 == grp_list$gene1[i]) & (gene1 == grp_list$gene1[i])) %>% 
                      group_by(sample_id, gene1, gene2) %>% 
                      summarise(mean_importance = mean(importance), 
                                mean_count = mean(n_count)) %>%
                      collect() %>%
                      pivot_wider(names_from = "sample_id", 
                                  values_from = c("mean_importance", "mean_count"), 
                                  names_sep = "__")
                    
                  })

But this means it will repeatedly interact with the files on disk, I suppose. Is there a more efficient way to do this?

Component(s)

R

The text was updated successfully, but these errors were encountered:

ag1805x added the Type: usage Issue is a user question label Mar 1, 2025

github-actions bot added the Component: R label Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling large data #45645

Handling large data #45645

ag1805x commented Mar 1, 2025 •

edited

Loading

Handling large data #45645

Handling large data #45645

Comments

ag1805x commented Mar 1, 2025 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

ag1805x commented Mar 1, 2025 •

edited

Loading