Possible expand argument for group_by #4392

dannyparsons · 2019-05-28T16:30:22Z

At some point in the discussion of preserving zero-length groups there was a comment by @hadley about separating out preserving zero-length groups of a factor and zero-length groups generated by combinations that don't appear in the data:

Maybe group_by needs both drop and expand arguments? drop = FALSE would keep all size zero groups generated by factor levels that don't appear in the data. expand = TRUE would keep all size zero groups generated by combinations of values that don't appear in the data.

Is there still an intention to implement something along these lines? I completely agree that expanding combinations is a different process to preserving empty levels.

The sort of examples I work where this matters are time series data recorded at multiple sites - such as daily weather data. Often the time period is very different for each site. So, when using site and year as grouping factors, I am only interested in the years I have data for at each site.

In the example below, site a has 10 years and site b has 1 year so I'd always like 11 rows in the summary. If some years get filtered out during the calculation, I'd like to keep those levels (to know that the event didn't occur that year) but I don't want the combinations of site and year that didn't appear in the data to be added as well.

So .drop = TRUE removes some rows that are needed and .drop = FALSE adds combinations which were not in the original data.

I know there are work arounds, like when .drop did not exist for a single factor, but not in a very clean way.
This is where I felt the further addition of an expand argument would be very useful.

library(tidyverse)
set.seed(1)

df <- tibble(site = factor(c(rep("a", 120), rep("b", 12))),
             date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
             year = factor(lubridate::year(date)),
             value = rnorm(132, 50, 10))

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = TRUE) %>%
  summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA        
#> 12 b     2001  NA        
#> 13 b     2002  NA        
#> 14 b     2003  NA        
#> 15 b     2004  NA        
#> 16 b     2005  NA        
#> 17 b     2006  NA        
#> 18 b     2007  NA        
#> 19 b     2008  NA        
#> 20 b     2009  NA

The text was updated successfully, but these errors were encountered:

romainfrancois · 2019-05-30T12:15:05Z

Let's delay that until at least 0.9. Chances are group_by() will be much simpler and based on functions from vctrs rather than being internally implemented in C++.

This will make it simpler to implement alternative versions of group_by() either here or in other packages.

romainfrancois · 2021-04-21T15:10:32Z

library(dplyr, warn.conflicts = FALSE)
set.seed(1)

df <- tibble(site = factor(c(rep("a", 120), rep("b", 12))),
             date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
             year = factor(lubridate::year(date)),
             value = rnorm(132, 50, 10))

it seems that doing the grouping on the entire df gives you the grouping structure you want, e.g. 11 groups

df %>% group_by(site, year) %>% group_keys()
#> # A tibble: 11 x 2
#>    site  year 
#>    <fct> <fct>
#>  1 a     2000 
#>  2 a     2001 
#>  3 a     2002 
#>  4 a     2003 
#>  5 a     2004 
#>  6 a     2005 
#>  7 a     2006 
#>  8 a     2007 
#>  9 a     2008 
#> 10 a     2009 
#> 11 b     2000

so then you can use filter(.preserve = TRUE) to filter but preserve that structure, so that the summarise() operates on these 11 groups:

df %>% 
  group_by(site, year, .drop = TRUE) %>%
  filter(value > 65) %>%
  summarise(f = first(date))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups` argument.
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

^{Created on 2021-04-21 by the reprex package (v0.3.0)}

In addition to that, we have new_grouped_df() that can be used to create your own grouped data frame with a bit of extra work, and perhaps a join:

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(x = 1:2, g = 1:2)

if you group df by g you get 2 groups

(group_data <- group_by(df, g) %>% group_data())
#> # A tibble: 2 x 2
#>       g       .rows
#>   <int> <list<int>>
#> 1     1         [1]
#> 2     2         [1]

but you can tweak that structure if you like

my_group_data <- left_join(
  tibble(g = 1:3), group_data
)
#> Joining, by = "g"
g <- new_grouped_df(df, groups = my_group_data)
summarise(g, size = n())
#> # A tibble: 3 x 2
#>       g  size
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     0

^{Created on 2021-04-21 by the reprex package (v2.0.0)}

dannyparsons · 2021-04-21T15:25:23Z

Many thanks for the detailed reply, this is great, answers exactly what I was needing.

romainfrancois added this to the 0.9.0 milestone May 30, 2019

juangomezduaso mentioned this issue Jun 21, 2019

rray_summarise() r-lib/rray#231

Open

hadley added feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 labels Dec 11, 2019

hadley mentioned this issue Jan 6, 2020

Support cartesian cross-product grouping (like group_by(.drop = FALSE), only better …) #4292

Closed

hadley removed this from the 1.0.0 milestone Mar 1, 2020

nathaneastwood mentioned this issue Mar 29, 2021

Empty grouping levels in the "groups" attribute #5830

Closed

romainfrancois closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible expand argument for group_by #4392

Possible expand argument for group_by #4392

dannyparsons commented May 28, 2019

romainfrancois commented May 30, 2019

romainfrancois commented Apr 21, 2021

dannyparsons commented Apr 21, 2021

Possible expand argument for group_by #4392

Possible expand argument for group_by #4392

Comments

dannyparsons commented May 28, 2019

romainfrancois commented May 30, 2019

romainfrancois commented Apr 21, 2021

dannyparsons commented Apr 21, 2021