-
Notifications
You must be signed in to change notification settings - Fork 415
Allow components of names_to in pivot_longer to be NA #793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another example where this would have been useful is shown here: In that example we could have omitted the |
Another example from SO can be found here: https://stackoverflow.com/questions/58785329/is-there-a-function-in-r-that-will-let-me-convert-a-dataset-into-long-format-b/58785660#58785660 In the first |
Yeah, I like this idea, and it matches the syntax in |
You can already use However, there's another problem that prevents this from working: there's currently no way to have a single library(tidyr)
dat <- data.frame(
ID = c(21785L, 21785L, 21785L),
HR_1 = c(0.828273303, 6.404590021, 0.775568448),
Weekday_1 = c(2L, 3L, 2L),
HR_2 = c(NA, 1.122899914, 0.850113234),
Weekday_2 = c(NA, 4L, 3L),
HR_3 = c(NA, 0.866757168, 0.868943246),
Weekday_3 = c(NA, 5L, 4L),
HR_4 = c(NA, 0.563804788, 0.728656328),
Weekday_4 = c(NA, 6L, 5L)
)
dat %>% pivot_longer(-ID,
names_to = ".value",
names_pattern = "\\D+_(\\d+)"
)
#> New names:
#> * .value -> .value...2
#> * .value -> .value...3
#> Error: `spec` must have `.name` and `.value` columns
dat %>% pivot_longer(-ID,
names_to = c(".value", NA),
names_sep = "_"
)
#> Error: `spec` must have at least 3 columns Created on 2019-11-24 by the reprex package (v0.3.0) I'll need to think about that assumption more. |
Reprex: library(tidyr)
url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion
spec <- tibble(.name = setdiff(names(df2), "id"))
spec$.value <- rep(c("title", "year", "artist"), length = nrow(spec))
spec$seq <- rep(1:(nrow(spec) / 3), each = 3)
df2 %>% pivot_longer_spec(spec)
#> # A tibble: 1,150 x 5
#> id seq title year artist
#> <dbl> <int> <chr> <dbl> <chr>
#> 1 1 1 Sophisticated Lady / Tea… 1933 Art Tatum
#> 2 1 2 The Genius Of Art Tatum,… 1955 Art Tatum
#> 3 1 3 The Tatum Group Masterpi… 1964 Art Tatum / Lionel Hampton / Har…
#> 4 1 4 Live Sessions 1940 / 1941 1975 Art Tatum
#> 5 1 5 20th Century Piano Genius 1986 Art Tatum
#> 6 1 6 Jazz Masters (100 Ans De… 1998 Art Tatum
#> 7 1 7 The Art Tatum - Ben Webs… 2015 Art Tatum / Ben Webster
#> 8 1 8 El Gran Tatum NA Art Tatum
#> 9 1 9 Sweet Georgia Brown / Sh… 1945 Benny Goodman Quintet* / Esquire…
#> 10 1 10 The Immortal Live Sessio… 1975 Louis Armstrong
#> # … with 1,140 more rows Created on 2019-12-06 by the reprex package (v0.3.0) The problem is that we have to generate a sequence placeholder that is not useful afterwards. This seems somehow related to #792. |
Yes, this is exactly the same problem as #792. With local experimental fix: library(tidyr)
df <- data.frame(
ID = c(21785L, 21785L, 21785L),
HR_1 = c(0.828273303, 6.404590021, 0.775568448),
Weekday_1 = c(2L, 3L, 2L),
HR_2 = c(NA, 1.122899914, 0.850113234),
Weekday_2 = c(NA, 4L, 3L),
HR_3 = c(NA, 0.866757168, 0.868943246),
Weekday_3 = c(NA, 5L, 4L),
HR_4 = c(NA, 0.563804788, 0.728656328),
Weekday_4 = c(NA, 6L, 5L)
)
df %>% pivot_longer(-ID, names_to = ".value", names_pattern = "(\\D+)_\\d+")
#> # A tibble: 12 x 3
#> ID HR Weekday
#> <int> <dbl> <int>
#> 1 21785 0.828 2
#> 2 21785 NA NA
#> 3 21785 NA NA
#> 4 21785 NA NA
#> 5 21785 6.40 3
#> 6 21785 1.12 4
#> 7 21785 0.867 5
#> 8 21785 0.564 6
#> 9 21785 0.776 2
#> 10 21785 0.850 3
#> 11 21785 0.869 4
#> 12 21785 0.729 5
df %>% pivot_longer(-ID, names_to = c(".value", NA), names_sep = "_")
#> # A tibble: 12 x 3
#> ID HR Weekday
#> <int> <dbl> <int>
#> 1 21785 0.828 2
#> 2 21785 NA NA
#> 3 21785 NA NA
#> 4 21785 NA NA
#> 5 21785 6.40 3
#> 6 21785 1.12 4
#> 7 21785 0.867 5
#> 8 21785 0.564 6
#> 9 21785 0.776 2
#> 10 21785 0.850 3
#> 11 21785 0.869 4
#> 12 21785 0.729 5
df2 <- setNames(df, c("ID", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday"))
df2 %>% pivot_longer(-ID, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 12 x 4
#> ID .copy HR Weekday
#> <int> <int> <dbl> <int>
#> 1 21785 1 0.828 2
#> 2 21785 2 NA NA
#> 3 21785 3 NA NA
#> 4 21785 4 NA NA
#> 5 21785 1 6.40 3
#> 6 21785 2 1.12 4
#> 7 21785 3 0.867 5
#> 8 21785 4 0.564 6
#> 9 21785 1 0.776 2
#> 10 21785 2 0.850 3
#> 11 21785 3 0.869 4
#> 12 21785 4 0.729 5 url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion
names(df2) <- c(outer(c("title", "year", "artist"), 1:(ncol(df2) / 3), paste0), "id")
df2 %>% pivot_longer(-id, names_to = c(".value", NA), names_pattern = "(.*?)(\\d+)")
#> # A tibble: 1,150 x 4
#> id title year artist
#> <dbl> <chr> <dbl> <chr>
#> 1 1 Sophisticated Lady / Tea Fo… 1933 Art Tatum
#> 2 1 The Genius Of Art Tatum, No… 1955 Art Tatum
#> 3 1 The Tatum Group Masterpiece… 1964 Art Tatum / Lionel Hampton / Harry …
#> 4 1 Live Sessions 1940 / 1941 1975 Art Tatum
#> 5 1 20th Century Piano Genius 1986 Art Tatum
#> 6 1 Jazz Masters (100 Ans De Ja… 1998 Art Tatum
#> 7 1 The Art Tatum - Ben Webster… 2015 Art Tatum / Ben Webster
#> 8 1 El Gran Tatum NA Art Tatum
#> 9 1 Sweet Georgia Brown / Shiek… 1945 Benny Goodman Quintet* / Esquire Al…
#> 10 1 The Immortal Live Sessions … 1975 Louis Armstrong
#> # … with 1,140 more rows
names(df2) <- c(rep(c("title", "year", "artist"), (ncol(df2) - 1) / 3), "id")
df2 %>% pivot_longer(-id, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 1,150 x 5
#> id .copy title year artist
#> <dbl> <int> <chr> <dbl> <chr>
#> 1 1 1 Sophisticated Lady / Tea… 1933 Art Tatum
#> 2 1 2 The Genius Of Art Tatum,… 1955 Art Tatum
#> 3 1 3 The Tatum Group Masterpi… 1964 Art Tatum / Lionel Hampton / Har…
#> 4 1 4 Live Sessions 1940 / 1941 1975 Art Tatum
#> 5 1 5 20th Century Piano Genius 1986 Art Tatum
#> 6 1 6 Jazz Masters (100 Ans De… 1998 Art Tatum
#> 7 1 7 The Art Tatum - Ben Webs… 2015 Art Tatum / Ben Webster
#> 8 1 8 El Gran Tatum NA Art Tatum
#> 9 1 9 Sweet Georgia Brown / Sh… 1945 Benny Goodman Quintet* / Esquire…
#> 10 1 10 The Immortal Live Sessio… 1975 Louis Armstrong
#> # … with 1,140 more rows Created on 2019-12-06 by the reprex package (v0.3.0) This makes it more obvious that the behaviour with duplicate column names is inconsistent; it should probably work silently via the new special |
Steps to finish this up:
|
As an aside, regarding your observation about base
giving the following which now seems to work:
|
In the example below taken from https://stackoverflow.com/questions/58566740/pivot-by-group-for-unequal-data-size/58567045#58567045 we don't need the
Num
column so it would be nice to be able to specifyNA
instead of giving a dummy name. Also because that column is generatedvalues_drop_na=TRUE
won't eliminateNA
rows and we need to add additional statements (drop_na
,select
) to drop them and remove that column. These could have been eliminated had NA components been available.The text was updated successfully, but these errors were encountered: