Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling large data #45645

Open
ag1805x opened this issue Mar 1, 2025 · 0 comments
Open

Handling large data #45645

ag1805x opened this issue Mar 1, 2025 · 0 comments
Labels
Component: R Type: usage Issue is a user question

Comments

@ag1805x
Copy link

ag1805x commented Mar 1, 2025

Describe the usage question you have. Please include as many useful details as possible.

I'm working with 50 Parquet files (each file is of ~800MB with ~380000 rows and ~8 columns). I need to perform a grouped summarisation in R. Something like:

group_by(sample_id, gene1, gene1) %>% 
  summarise(mean_importance = mean(importance), 
            mean_count = mean(n_count)) %>%
  pivot_wider(names_from = "sample_id", 
              values_from = c("mean_importance", "mean_count"), 
              names_sep = "__")

Here, pivot_wider() is not available via arrow. So just before it, I need to collect() the data as a dataframe and then apply pivot_wider(). As soon as I apply collect(), I encounter memory issues (core dumped, bad_alloc). What is the best way to handle this large data without running into memory errors? The experimental batch processing seemed like an option, but I will not be able to make batches by random sub-setting. Rather, it would be ideal to subset via the group_by columns.

I was trying out listing all the possible groups and then using mclapply to process the data.

pq_files <- list.files("path", full.names = TRUE)
pq_files <- open_dataset(sources = pq_files)

grp_list <- expand_grid("gene1" = gene1, 
                         "gene2" = gene2) %>% 
  filter(gene1 != gene2)

res <- mclapply(X = 1:nrow(grp_list), 
                  mc.cores = 60,
                  FUN = function(i){
                    
                    pq_files %>% 
                      filter((gene1 == grp_list$gene1[i]) & (gene1 == grp_list$gene1[i])) %>% 
                      group_by(sample_id, gene1, gene2) %>% 
                      summarise(mean_importance = mean(importance), 
                                mean_count = mean(n_count)) %>%
                      collect() %>%
                      pivot_wider(names_from = "sample_id", 
                                  values_from = c("mean_importance", "mean_count"), 
                                  names_sep = "__")
                    
                  })

But this means it will repeatedly interact with the files on disk, I suppose. Is there a more efficient way to do this?

Component(s)

R

@ag1805x ag1805x added the Type: usage Issue is a user question label Mar 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

1 participant