Allow components of names_to in pivot_longer to be NA #793

ggrothendieck · 2019-10-26T14:08:37Z

In the example below taken from https://stackoverflow.com/questions/58566740/pivot-by-group-for-unequal-data-size/58567045#58567045 we don't need the Num column so it would be nice to be able to specify NA instead of giving a dummy name. Also because that column is generated values_drop_na=TRUE won't eliminate NA rows and we need to add additional statements (drop_na, select) to drop them and remove that column. These could have been eliminated had NA components been available.

dat <- structure(list(ID = c(21785L, 21785L, 21785L), HR1 = c(0.828273303, 
6.404590021, 0.775568448), Weekday1 = c(2L, 3L, 2L), HR2 = c(NA, 
1.122899914, 0.850113234), Weekday2 = c(NA, 4L, 3L), HR3 = c(NA, 
0.866757168, 0.868943246), Weekday3 = c(NA, 5L, 4L), HR4 = c(NA, 
0.563804788, 0.728656328), Weekday4 = c(NA, 6L, 5L), HR5 = c(NA, 
0.888109208, 0.823803733), Weekday5 = c(NA, 7L, 6L), HR6 = c(NA, 
0.578834113, 0.863467391), Weekday6 = c(NA, 1L, 7L), HR7 = c(NA, 
NA, 0.939920869), Weekday7 = c(NA, 2, 8)), row.names = c(5L, 
163L, 167L), class = "data.frame")

library(dplyr)
library(tidyr)

dat %>% pivot_longer(-ID,
   names_to = c(".value", "Num"),  # names_to = c(".value", NA),
   names_pattern = "(\\D+)(\\d+)") %>%   
   drop_na %>% # would not need this line if NA in names_to were allowed
   select(-Num) # would not need this line if NA in names_to were allowed

The text was updated successfully, but these errors were encountered:

ggrothendieck · 2019-11-11T01:05:09Z

Another example where this would have been useful is shown here:
https://stackoverflow.com/questions/58792882/appending-values-in-a-column-according-to-row-index/58794307#58794307

In that example we could have omitted the select(-name) had we been able to write names_to = NA .

ggrothendieck · 2019-11-11T12:20:01Z

Another example from SO can be found here: https://stackoverflow.com/questions/58785329/is-there-a-function-in-r-that-will-let-me-convert-a-dataset-into-long-format-b/58785660#58785660

In the first pivot_longer in that example if we could have written to_names = NA we could have omitted the subsequent select(-name) making the pipeline more compact and eliminating the need for generating a column which is never used.

hadley · 2019-11-13T21:59:16Z

Yeah, I like this idea, and it matches the syntax in separate()

hadley · 2019-11-24T21:26:34Z

You can already use NA in names, provided that you're using names_sep because it uses str_separate() which already supports this behaviour. I'll also add it to str_extract() for consistency, but note that you don't really use it since you can just drop the () grouping.

However, there's another problem that prevents this from working: there's currently no way to have a single names_to = ".value":

library(tidyr)

dat <- data.frame(
  ID = c(21785L, 21785L, 21785L),
  HR_1 = c(0.828273303, 6.404590021, 0.775568448), 
  Weekday_1 = c(2L, 3L, 2L), 
  HR_2 = c(NA, 1.122899914, 0.850113234),
  Weekday_2 = c(NA, 4L, 3L), 
  HR_3 = c(NA, 0.866757168, 0.868943246), 
  Weekday_3 = c(NA, 5L, 4L), 
  HR_4 = c(NA, 0.563804788, 0.728656328), 
  Weekday_4 = c(NA, 6L, 5L)
)

dat %>% pivot_longer(-ID, 
  names_to = ".value", 
  names_pattern = "\\D+_(\\d+)"
)
#> New names:
#> * .value -> .value...2
#> * .value -> .value...3
#> Error: `spec` must have `.name` and `.value` columns

dat %>% pivot_longer(-ID,
  names_to = c(".value", NA),
  names_sep = "_"
)
#> Error: `spec` must have at least 3 columns

^{Created on 2019-11-24 by the reprex package (v0.3.0)}

I'll need to think about that assumption more.

Part of #793

ggrothendieck · 2019-12-02T20:57:59Z

Here is another example: https://stackoverflow.com/questions/59145663/how-to-restructure-very-wide-dataframes-with-dplyr-using-an-index/59145939#59145939

hadley · 2019-12-06T22:52:41Z

Reprex:

library(tidyr)

url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion

spec <- tibble(.name = setdiff(names(df2), "id"))
spec$.value <- rep(c("title", "year", "artist"), length = nrow(spec))
spec$seq <- rep(1:(nrow(spec) / 3), each = 3)

df2 %>% pivot_longer_spec(spec)
#> # A tibble: 1,150 x 5
#>       id   seq title                      year artist                           
#>    <dbl> <int> <chr>                     <dbl> <chr>                            
#>  1     1     1 Sophisticated Lady / Tea…  1933 Art Tatum                        
#>  2     1     2 The Genius Of Art Tatum,…  1955 Art Tatum                        
#>  3     1     3 The Tatum Group Masterpi…  1964 Art Tatum / Lionel Hampton / Har…
#>  4     1     4 Live Sessions 1940 / 1941  1975 Art Tatum                        
#>  5     1     5 20th Century Piano Genius  1986 Art Tatum                        
#>  6     1     6 Jazz Masters (100 Ans De…  1998 Art Tatum                        
#>  7     1     7 The Art Tatum - Ben Webs…  2015 Art Tatum / Ben Webster          
#>  8     1     8 El Gran Tatum                NA Art Tatum                        
#>  9     1     9 Sweet Georgia Brown / Sh…  1945 Benny Goodman Quintet* / Esquire…
#> 10     1    10 The Immortal Live Sessio…  1975 Louis Armstrong                  
#> # … with 1,140 more rows

^{Created on 2019-12-06 by the reprex package (v0.3.0)}

The problem is that we have to generate a sequence placeholder that is not useful afterwards. This seems somehow related to #792.

hadley · 2019-12-06T23:33:59Z

Yes, this is exactly the same problem as #792. With local experimental fix:

library(tidyr)

df <- data.frame(
  ID = c(21785L, 21785L, 21785L),
  HR_1 = c(0.828273303, 6.404590021, 0.775568448), 
  Weekday_1 = c(2L, 3L, 2L), 
  HR_2 = c(NA, 1.122899914, 0.850113234),
  Weekday_2 = c(NA, 4L, 3L), 
  HR_3 = c(NA, 0.866757168, 0.868943246), 
  Weekday_3 = c(NA, 5L, 4L), 
  HR_4 = c(NA, 0.563804788, 0.728656328), 
  Weekday_4 = c(NA, 6L, 5L)
)

df %>% pivot_longer(-ID, names_to = ".value", names_pattern = "(\\D+)_\\d+")
#> # A tibble: 12 x 3
#>       ID     HR Weekday
#>    <int>  <dbl>   <int>
#>  1 21785  0.828       2
#>  2 21785 NA          NA
#>  3 21785 NA          NA
#>  4 21785 NA          NA
#>  5 21785  6.40        3
#>  6 21785  1.12        4
#>  7 21785  0.867       5
#>  8 21785  0.564       6
#>  9 21785  0.776       2
#> 10 21785  0.850       3
#> 11 21785  0.869       4
#> 12 21785  0.729       5
df %>% pivot_longer(-ID, names_to = c(".value", NA), names_sep = "_")
#> # A tibble: 12 x 3
#>       ID     HR Weekday
#>    <int>  <dbl>   <int>
#>  1 21785  0.828       2
#>  2 21785 NA          NA
#>  3 21785 NA          NA
#>  4 21785 NA          NA
#>  5 21785  6.40        3
#>  6 21785  1.12        4
#>  7 21785  0.867       5
#>  8 21785  0.564       6
#>  9 21785  0.776       2
#> 10 21785  0.850       3
#> 11 21785  0.869       4
#> 12 21785  0.729       5

df2 <- setNames(df, c("ID", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday"))
df2 %>% pivot_longer(-ID, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 12 x 4
#>       ID .copy     HR Weekday
#>    <int> <int>  <dbl>   <int>
#>  1 21785     1  0.828       2
#>  2 21785     2 NA          NA
#>  3 21785     3 NA          NA
#>  4 21785     4 NA          NA
#>  5 21785     1  6.40        3
#>  6 21785     2  1.12        4
#>  7 21785     3  0.867       5
#>  8 21785     4  0.564       6
#>  9 21785     1  0.776       2
#> 10 21785     2  0.850       3
#> 11 21785     3  0.869       4
#> 12 21785     4  0.729       5

url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion

names(df2) <- c(outer(c("title", "year", "artist"), 1:(ncol(df2) / 3), paste0), "id")
df2 %>% pivot_longer(-id, names_to = c(".value", NA), names_pattern = "(.*?)(\\d+)")
#> # A tibble: 1,150 x 4
#>       id title                         year artist                              
#>    <dbl> <chr>                        <dbl> <chr>                               
#>  1     1 Sophisticated Lady / Tea Fo…  1933 Art Tatum                           
#>  2     1 The Genius Of Art Tatum, No…  1955 Art Tatum                           
#>  3     1 The Tatum Group Masterpiece…  1964 Art Tatum / Lionel Hampton / Harry …
#>  4     1 Live Sessions 1940 / 1941     1975 Art Tatum                           
#>  5     1 20th Century Piano Genius     1986 Art Tatum                           
#>  6     1 Jazz Masters (100 Ans De Ja…  1998 Art Tatum                           
#>  7     1 The Art Tatum - Ben Webster…  2015 Art Tatum / Ben Webster             
#>  8     1 El Gran Tatum                   NA Art Tatum                           
#>  9     1 Sweet Georgia Brown / Shiek…  1945 Benny Goodman Quintet* / Esquire Al…
#> 10     1 The Immortal Live Sessions …  1975 Louis Armstrong                     
#> # … with 1,140 more rows

names(df2) <- c(rep(c("title", "year", "artist"), (ncol(df2) - 1) / 3), "id")
df2 %>% pivot_longer(-id, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 1,150 x 5
#>       id .copy title                      year artist                           
#>    <dbl> <int> <chr>                     <dbl> <chr>                            
#>  1     1     1 Sophisticated Lady / Tea…  1933 Art Tatum                        
#>  2     1     2 The Genius Of Art Tatum,…  1955 Art Tatum                        
#>  3     1     3 The Tatum Group Masterpi…  1964 Art Tatum / Lionel Hampton / Har…
#>  4     1     4 Live Sessions 1940 / 1941  1975 Art Tatum                        
#>  5     1     5 20th Century Piano Genius  1986 Art Tatum                        
#>  6     1     6 Jazz Masters (100 Ans De…  1998 Art Tatum                        
#>  7     1     7 The Art Tatum - Ben Webs…  2015 Art Tatum / Ben Webster          
#>  8     1     8 El Gran Tatum                NA Art Tatum                        
#>  9     1     9 Sweet Georgia Brown / Sh…  1945 Benny Goodman Quintet* / Esquire…
#> 10     1    10 The Immortal Live Sessio…  1975 Louis Armstrong                  
#> # … with 1,140 more rows

^{Created on 2019-12-06 by the reprex package (v0.3.0)}

This makes it more obvious that the behaviour with duplicate column names is inconsistent; it should probably work silently via the new special .seq variable.

hadley · 2019-12-06T23:37:46Z

Steps to finish this up:

Write unit tests
Update duplicated variable name behaviour
Can we move .seq generation in to check_spec()?
Document new special .seq variable
Document use of NA in names_to

ggrothendieck · 2019-12-07T17:30:42Z

As an aside, regarding your observation about base reshape not working correctly the problem seems to be that base reshape gets confused if there are multiple columns witih the same name. It does seem work to if we make them unique.

# d is same as above
d <- structure(list(country = c("Austria", "Belgium", "Bulgaria", 
"Estonia", "France", "Germany", "Switzerland"), A = c(NA, -2.58060150400384, 
-0.0623757258909573, 0.267776001395166, NA, NA, 0.048219924249952
), A = c(NA, -0.249416955035044, 1.3525450891501, 1.04446768824697, 
NA, -0.0582347596434839, -0.891400228849837), A = c(1.82469277697851, 
NA, NA, 1.04231605324821, NA, -0.900145118946308, -1.19320727433597
), A = c(NA, NA, NA, NA, NA, NA, NA), B = c(NA, NA, NA, NA, NA, 
NA, NA), B = c(0.633712375393134, NA, 1.24760861316098, -0.092964787061478, 
-0.59403260962332, NA, -0.650348234181285), B = c(0.587318286831079, 
NA, NA, 0.348890470222513, NA, NA, NA), B = c(0.0645316087966406, 
-0.279456557428068, NA, NA, -0.0627400036074545, 1.30419117694731, 
-0.484654596062051)), row.names = c(NA, -7L), class = "data.frame")

# dd is like d except it has unique column names
dd <- d 
names(dd) <- make.unique(names(dd))

names1 <- names(dd)[-1]
v <- split(names1, substr(names1, 1, 1))
long <- reshape(dd, dir = "long", varying = v, v.names = names(v))
long

giving the following which now seems to work:

        country time           A           B id
1.1     Austria    1          NA          NA  1
2.1     Belgium    1 -2.58060150          NA  2
3.1    Bulgaria    1 -0.06237573          NA  3
4.1     Estonia    1  0.26777600          NA  4
5.1      France    1          NA          NA  5
6.1     Germany    1          NA          NA  6
7.1 Switzerland    1  0.04821992          NA  7
1.2     Austria    2          NA  0.63371238  1
2.2     Belgium    2 -0.24941696          NA  2
3.2    Bulgaria    2  1.35254509  1.24760861  3
4.2     Estonia    2  1.04446769 -0.09296479  4
5.2      France    2          NA -0.59403261  5
6.2     Germany    2 -0.05823476          NA  6
7.2 Switzerland    2 -0.89140023 -0.65034823  7
1.3     Austria    3  1.82469278  0.58731829  1
2.3     Belgium    3          NA          NA  2
3.3    Bulgaria    3          NA          NA  3
4.3     Estonia    3  1.04231605  0.34889047  4
5.3      France    3          NA          NA  5
6.3     Germany    3 -0.90014512          NA  6
7.3 Switzerland    3 -1.19320727          NA  7
1.4     Austria    4          NA  0.06453161  1
2.4     Belgium    4          NA -0.27945656  2
3.4    Bulgaria    4          NA          NA  3
4.4     Estonia    4          NA          NA  4
5.4      France    4          NA -0.06274000  5
6.4     Germany    4          NA  1.30419118  6
7.4 Switzerland    4          NA -0.48465460  7

hadley added feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes" labels Nov 13, 2019

hadley added a commit that referenced this issue Nov 24, 2019

Support missing into names in extract()

dee0017

Part of #793

hadley added this to the v1.1.0 milestone Nov 28, 2019

hadley mentioned this issue Dec 6, 2019

Allow pivot_longer() names_to to be list #821

Closed

hadley closed this as completed in e1b548b Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow components of names_to in pivot_longer to be NA #793

Allow components of names_to in pivot_longer to be NA #793

ggrothendieck commented Oct 26, 2019

ggrothendieck commented Nov 11, 2019

ggrothendieck commented Nov 11, 2019

hadley commented Nov 13, 2019

hadley commented Nov 24, 2019

ggrothendieck commented Dec 2, 2019

hadley commented Dec 6, 2019 •

edited

Loading

hadley commented Dec 6, 2019

hadley commented Dec 6, 2019 •

edited

Loading

ggrothendieck commented Dec 7, 2019 •

edited

Loading

Allow components of names_to in pivot_longer to be NA #793

Allow components of names_to in pivot_longer to be NA #793

Comments

ggrothendieck commented Oct 26, 2019

ggrothendieck commented Nov 11, 2019

ggrothendieck commented Nov 11, 2019

hadley commented Nov 13, 2019

hadley commented Nov 24, 2019

ggrothendieck commented Dec 2, 2019

hadley commented Dec 6, 2019 • edited Loading

hadley commented Dec 6, 2019

hadley commented Dec 6, 2019 • edited Loading

ggrothendieck commented Dec 7, 2019 • edited Loading

hadley commented Dec 6, 2019 •

edited

Loading

hadley commented Dec 6, 2019 •

edited

Loading

ggrothendieck commented Dec 7, 2019 •

edited

Loading