Skip to content

Make the current group size a variable updated by reference #6727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Feb 14, 2023

This PR:

  • Provides access to the current group size through $get_current_group_size()
  • Uses this helper in n(), which makes it a little faster

This also gives us an easy private way to access the current group size for #6685 (comment)

It also gets rid of the need to do parent.env(chops) internally, which always felt a little clunky to me.

library(dplyr)

set.seed(123)

# ~100k groups, so that many evaluations of `n()`
x <- sample(1e5, 1e6, replace = TRUE)
df <- tibble(x = x)
df <- group_by(df, x)
df
#> # A tibble: 1,000,000 × 1
#> # Groups:   x [99,995]
#>        x
#>    <int>
#>  1 51663
#>  2 57870
#>  3  2986
#>  4 29925
#>  5 95246
#>  6 68293
#>  7 62555
#>  8 45404
#>  9 65161
#> 10 46435
#> # … with 999,990 more rows

bench::mark(summarise(df, n = n()), iterations = 100)

# Main
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarise(df, n = n())    919ms    1.14s     0.873    7.15MB     13.8

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                  min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarise(df, n = n())    596ms    742ms      1.35    7.16MB     15.1

Created on 2023-02-15 with reprex v2.0.2.9000

@DavisVaughan DavisVaughan marked this pull request as ready for review February 15, 2023 13:47
R/data-mask.R Outdated
Comment on lines 26 to 31
private$env_current <- new_environment(data = list(
`dplyr:::current_group_id` = 0L,
`dplyr:::current_group_size` = 0L
))

private$chops <- .Call(dplyr_lazy_vec_chop_impl, data, rows, private$env_current, private$grouped, private$rowwise)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously created an environment sort of like env_current from inside dplyr_lazy_vec_chop_impl and set it as the parent env of chops, but that meant we would always access that end on the R side using parent.env(private$chops) and that was a little clunky.

Creating the env on the R side and passing it in seems a little cleaner, and means we have easy access to env_current and its "current" group information variables for use in n() and cur_group_id().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. I agree this makes the code clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a more informative name? current_group_info?

R/data-mask.R Outdated
Comment on lines 129 to 166
get_current_group_id = function() {
# `dplyr:::current_group_id` is modified by reference at the C level.
# If the result of `get_current_group_id()` is used in a persistent way
# (like in `cur_group_id()`), then it must be duplicated on the way out.
private[["env_current"]][["dplyr:::current_group_id"]]
},

get_current_group_size = function() {
# `dplyr:::current_group_size` is modified by reference at the C level.
# If the result of `get_current_group_size()` is used in a persistent way
# (like in `n()`), then it must be duplicated on the way out.
private[["env_current"]][["dplyr:::current_group_size"]]
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note that these return the exact reference to the object we modify by reference at the C level.

This is why we duplicate() the result before returning from n() and cur_group_id(). If we just returned the variable without duplicating then it could get modified by reference accidentally

@DavisVaughan DavisVaughan requested a review from lionel- February 15, 2023 14:05
R/data-mask.R Outdated
Comment on lines 26 to 31
private$env_current <- new_environment(data = list(
`dplyr:::current_group_id` = 0L,
`dplyr:::current_group_size` = 0L
))

private$chops <- .Call(dplyr_lazy_vec_chop_impl, data, rows, private$env_current, private$grouped, private$rowwise)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement. I agree this makes the code clearer.

R/data-mask.R Outdated
Comment on lines 26 to 31
private$env_current <- new_environment(data = list(
`dplyr:::current_group_id` = 0L,
`dplyr:::current_group_size` = 0L
))

private$chops <- .Call(dplyr_lazy_vec_chop_impl, data, rows, private$env_current, private$grouped, private$rowwise)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a more informative name? current_group_info?

@DavisVaughan DavisVaughan force-pushed the feature/faster-group-size-detection branch from 40bdc3b to a7f05dc Compare February 16, 2023 20:26
@DavisVaughan DavisVaughan merged commit f4d36fb into tidyverse:main Feb 16, 2023
@DavisVaughan DavisVaughan deleted the feature/faster-group-size-detection branch February 16, 2023 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants