Add support for `CROSS JOIN` through `full_join` over non-overlapping columns #24

javierluraschi · 2017-06-28T06:26:37Z

Fix tidyverse/dplyr#2924 by adding support for cross joins.

Currently,

d1_mem <- memdb_frame(a = 1:5)
d2_mem <- memdb_frame(b = 1:5)
d1_mem %>% full_join(d2_mem)

Triggers: Error: by required, because the data sources have no common variables.

This PR implements CROSS JOINS support for full_join over non-overlapping columns.

ianmcook · 2017-06-28T11:28:22Z

R/tbl-lazy.R

@@ -201,7 +201,14 @@ distinct_.tbl_lazy <- function(.data, ..., .dots = list(), .keep_all = FALSE) {
 add_op_join <- function(x, y, type, by = NULL, copy = FALSE,
                        suffix = c(".x", ".y"),
                        auto_index = FALSE, ...) {
-  by <- common_by(by, x, y)
+  by_intersect <- intersect(tbl_vars(x), tbl_vars(y))


This doesn't handle the case of a full join where by is a named vector that maps the names of the join columns in the left table to the join columns in the right table, as in:

full_join(tbl_left, tbl_right, by = c("xl" = "xr"))

Thanks; 6e5daa9 fixes it

Thanks for the reference, no change here.

ianmcook · 2017-06-28T19:56:26Z

Let's please discuss at tidyverse/dplyr#2924 whether this is the best approach for implementing cross joins.

hadley · 2017-10-23T18:35:19Z

I think it's ok to merge this as a short term work around for the absence of full_join().

hadley · 2017-10-23T20:29:40Z

R/sql-generic.R

    stop("Unknown join type:", type, call. = FALSE)
  )

  select <- sql_join_vars(con, vars)

-  on <- sql_vector(
+  on <- if (length(by$x) + length(by$y) <= 0) NULL else sql_vector(


Can you please pull this out into:

if (...) { on <- ... } else { on <- ... }

Or maybe even consider pulling out into own function?

hadley · 2017-10-23T20:30:26Z

R/tbl-lazy.R

@@ -201,7 +201,14 @@ distinct_.tbl_lazy <- function(.data, ..., .dots = list(), .keep_all = FALSE) {
 add_op_join <- function(x, y, type, by = NULL, copy = FALSE,
                        suffix = c(".x", ".y"),
                        auto_index = FALSE, ...) {
-  by <- common_by(by, x, y)
+  by_intersect <- intersect(tbl_vars(x), tbl_vars(y))
+  by <- if (length(by_intersect) > 0 || !identical(type, "full") || !is.null(by)) {


Similarly here - I really don't like this style of if-else because it's easy to miss that the entire block gets assigned to some variable.

hadley · 2017-10-23T20:31:25Z

tests/testthat/test-joins.R

@@ -92,3 +92,8 @@ test_that("sql generated correctly for all sources", {

  expect_equal_tbls(xy)
 })
+
+test_that("full join is promoted to cross join for no overlapping variables", {
+  result <- df1 %>% full_join(df2) %>% collect()


I don't like making this the default because if you accidentally supply the wrong data frames you might create a query that takes hours to run. Could we make it explicit, i.e. by = character()?

hadley · 2017-10-26T18:26:31Z

Thanks!

javierluraschi added 5 commits June 27, 2017 23:11

add support for cross joins

a6a96af

update news

0785614

add cross join test

f23f7ea

support cross joins from with multiple columns

d07fbf3

reference issue in news

7f34383

javierluraschi mentioned this pull request Jun 28, 2017

full_join does not support cartesian sparklyr/sparklyr#771

Closed

ianmcook reviewed Jun 28, 2017

View reviewed changes

javierluraschi added 2 commits June 28, 2017 12:22

fix cross join test

b8994ea

only promote to cross join when by-param is not specified

6e5daa9

hadley reviewed Oct 23, 2017

View reviewed changes

javierluraschi added 3 commits October 25, 2017 11:36

Merge remote-tracking branch 'upstream/master' into feature/cross-join

b21a941

address pr feedback

6b8fd1b

promote to cross only when by is empty character

13878f9

hadley merged commit 3316463 into tidyverse:master Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for `CROSS JOIN` through `full_join` over non-overlapping columns #24

Add support for `CROSS JOIN` through `full_join` over non-overlapping columns #24

javierluraschi commented Jun 28, 2017

ianmcook Jun 28, 2017 •

edited

Loading

ianmcook Jun 28, 2017

javierluraschi Oct 25, 2017

ianmcook commented Jun 28, 2017

hadley commented Oct 23, 2017

hadley Oct 23, 2017

javierluraschi Oct 25, 2017

hadley Oct 23, 2017

javierluraschi Oct 25, 2017

hadley Oct 23, 2017

javierluraschi Oct 25, 2017

hadley commented Oct 26, 2017

Add support for CROSS JOIN through full_join over non-overlapping columns #24

Add support for CROSS JOIN through full_join over non-overlapping columns #24

Conversation

javierluraschi commented Jun 28, 2017

ianmcook Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ianmcook commented Jun 28, 2017

hadley commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadley commented Oct 26, 2017

Add support for `CROSS JOIN` through `full_join` over non-overlapping columns #24

Add support for `CROSS JOIN` through `full_join` over non-overlapping columns #24

ianmcook Jun 28, 2017 •

edited

Loading