Skip to content

Possible expand argument for group_by #4392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dannyparsons opened this issue May 28, 2019 · 3 comments
Closed

Possible expand argument for group_by #4392

dannyparsons opened this issue May 28, 2019 · 3 comments
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦

Comments

@dannyparsons
Copy link

At some point in the discussion of preserving zero-length groups there was a comment by @hadley about separating out preserving zero-length groups of a factor and zero-length groups generated by combinations that don't appear in the data:

Maybe group_by needs both drop and expand arguments? drop = FALSE would keep all size zero groups generated by factor levels that don't appear in the data. expand = TRUE would keep all size zero groups generated by combinations of values that don't appear in the data.

Is there still an intention to implement something along these lines? I completely agree that expanding combinations is a different process to preserving empty levels.

The sort of examples I work where this matters are time series data recorded at multiple sites - such as daily weather data. Often the time period is very different for each site. So, when using site and year as grouping factors, I am only interested in the years I have data for at each site.

In the example below, site a has 10 years and site b has 1 year so I'd always like 11 rows in the summary. If some years get filtered out during the calculation, I'd like to keep those levels (to know that the event didn't occur that year) but I don't want the combinations of site and year that didn't appear in the data to be added as well.

So .drop = TRUE removes some rows that are needed and .drop = FALSE adds combinations which were not in the original data.

I know there are work arounds, like when .drop did not exist for a single factor, but not in a very clean way.
This is where I felt the further addition of an expand argument would be very useful.

library(tidyverse)
set.seed(1)

df <- tibble(site = factor(c(rep("a", 120), rep("b", 12))),
             date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
             year = factor(lubridate::year(date)),
             value = rnorm(132, 50, 10))

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = TRUE) %>%
  summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA        
#> 12 b     2001  NA        
#> 13 b     2002  NA        
#> 14 b     2003  NA        
#> 15 b     2004  NA        
#> 16 b     2005  NA        
#> 17 b     2006  NA        
#> 18 b     2007  NA        
#> 19 b     2008  NA        
#> 20 b     2009  NA
@romainfrancois romainfrancois added this to the 0.9.0 milestone May 30, 2019
@romainfrancois
Copy link
Member

Let's delay that until at least 0.9. Chances are group_by() will be much simpler and based on functions from vctrs rather than being internally implemented in C++.

This will make it simpler to implement alternative versions of group_by() either here or in other packages.

@romainfrancois
Copy link
Member

library(dplyr, warn.conflicts = FALSE)
set.seed(1)

df <- tibble(site = factor(c(rep("a", 120), rep("b", 12))),
             date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
             year = factor(lubridate::year(date)),
             value = rnorm(132, 50, 10))

it seems that doing the grouping on the entire df gives you the grouping structure you want, e.g. 11 groups

df %>% group_by(site, year) %>% group_keys()
#> # A tibble: 11 x 2
#>    site  year 
#>    <fct> <fct>
#>  1 a     2000 
#>  2 a     2001 
#>  3 a     2002 
#>  4 a     2003 
#>  5 a     2004 
#>  6 a     2005 
#>  7 a     2006 
#>  8 a     2007 
#>  9 a     2008 
#> 10 a     2009 
#> 11 b     2000

so then you can use filter(.preserve = TRUE) to filter but preserve that structure, so that the summarise() operates on these 11 groups:

df %>% 
  group_by(site, year, .drop = TRUE) %>%
  filter(value > 65) %>%
  summarise(f = first(date))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups` argument.
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

Created on 2021-04-21 by the reprex package (v0.3.0)

In addition to that, we have new_grouped_df() that can be used to create your own grouped data frame with a bit of extra work, and perhaps a join:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = 1:2, g = 1:2)

if you group df by g you get 2 groups

(group_data <- group_by(df, g) %>% group_data())
#> # A tibble: 2 x 2
#>       g       .rows
#>   <int> <list<int>>
#> 1     1         [1]
#> 2     2         [1]

but you can tweak that structure if you like

my_group_data <- left_join(
  tibble(g = 1:3), group_data
)
#> Joining, by = "g"
g <- new_grouped_df(df, groups = my_group_data)
summarise(g, size = n())
#> # A tibble: 3 x 2
#>       g  size
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     0

Created on 2021-04-21 by the reprex package (v2.0.0)

@dannyparsons
Copy link
Author

Many thanks for the detailed reply, this is great, answers exactly what I was needing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Projects
None yet
Development

No branches or pull requests

3 participants