-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Possible expand argument for group_by #4392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Let's delay that until at least 0.9. Chances are This will make it simpler to implement alternative versions of |
library(dplyr, warn.conflicts = FALSE)
set.seed(1)
df <- tibble(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
year = factor(lubridate::year(date)),
value = rnorm(132, 50, 10)) it seems that doing the grouping on the entire df %>% group_by(site, year) %>% group_keys()
#> # A tibble: 11 x 2
#> site year
#> <fct> <fct>
#> 1 a 2000
#> 2 a 2001
#> 3 a 2002
#> 4 a 2003
#> 5 a 2004
#> 6 a 2005
#> 7 a 2006
#> 8 a 2007
#> 9 a 2008
#> 10 a 2009
#> 11 b 2000 so then you can use df %>%
group_by(site, year, .drop = TRUE) %>%
filter(value > 65) %>%
summarise(f = first(date))
#> `summarise()` has grouped output by 'site'. You can override using the `.groups` argument.
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01 Created on 2021-04-21 by the reprex package (v0.3.0) In addition to that, we have library(dplyr, warn.conflicts = FALSE)
df <- data.frame(x = 1:2, g = 1:2) if you group (group_data <- group_by(df, g) %>% group_data())
#> # A tibble: 2 x 2
#> g .rows
#> <int> <list<int>>
#> 1 1 [1]
#> 2 2 [1] but you can tweak that structure if you like my_group_data <- left_join(
tibble(g = 1:3), group_data
)
#> Joining, by = "g"
g <- new_grouped_df(df, groups = my_group_data)
summarise(g, size = n())
#> # A tibble: 3 x 2
#> g size
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 0 Created on 2021-04-21 by the reprex package (v2.0.0) |
Many thanks for the detailed reply, this is great, answers exactly what I was needing. |
At some point in the discussion of preserving zero-length groups there was a comment by @hadley about separating out preserving zero-length groups of a factor and zero-length groups generated by combinations that don't appear in the data:
Is there still an intention to implement something along these lines? I completely agree that expanding combinations is a different process to preserving empty levels.
The sort of examples I work where this matters are time series data recorded at multiple sites - such as daily weather data. Often the time period is very different for each site. So, when using site and year as grouping factors, I am only interested in the years I have data for at each site.
In the example below, site
a
has 10 years and siteb
has 1 year so I'd always like 11 rows in the summary. If some years get filtered out during the calculation, I'd like to keep those levels (to know that the event didn't occur that year) but I don't want the combinations of site and year that didn't appear in the data to be added as well.So
.drop = TRUE
removes some rows that are needed and.drop = FALSE
adds combinations which were not in the original data.I know there are work arounds, like when
.drop
did not exist for a single factor, but not in a very clean way.This is where I felt the further addition of an
expand
argument would be very useful.The text was updated successfully, but these errors were encountered: