Skip to content

Add support for CROSS JOIN through full_join over non-overlapping columns #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 26, 2017

Conversation

javierluraschi
Copy link
Contributor

Fix tidyverse/dplyr#2924 by adding support for cross joins.

Currently,

d1_mem <- memdb_frame(a = 1:5)
d2_mem <- memdb_frame(b = 1:5)
d1_mem %>% full_join(d2_mem)

Triggers: Error: by required, because the data sources have no common variables.

This PR implements CROSS JOINS support for full_join over non-overlapping columns.

R/tbl-lazy.R Outdated
@@ -201,7 +201,14 @@ distinct_.tbl_lazy <- function(.data, ..., .dots = list(), .keep_all = FALSE) {
add_op_join <- function(x, y, type, by = NULL, copy = FALSE,
suffix = c(".x", ".y"),
auto_index = FALSE, ...) {
by <- common_by(by, x, y)
by_intersect <- intersect(tbl_vars(x), tbl_vars(y))
Copy link
Contributor

@ianmcook ianmcook Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't handle the case of a full join where by is a named vector that maps the names of the join columns in the left table to the join columns in the right table, as in:

full_join(tbl_left, tbl_right, by = c("xl" = "xr"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; 6e5daa9 fixes it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference, no change here.

@ianmcook
Copy link
Contributor

Let's please discuss at tidyverse/dplyr#2924 whether this is the best approach for implementing cross joins.

@hadley
Copy link
Member

hadley commented Oct 23, 2017

I think it's ok to merge this as a short term work around for the absence of full_join().

R/sql-generic.R Outdated
stop("Unknown join type:", type, call. = FALSE)
)

select <- sql_join_vars(con, vars)

on <- sql_vector(
on <- if (length(by$x) + length(by$y) <= 0) NULL else sql_vector(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please pull this out into:

if (...) {
  on <- ...
} else {
  on <- ...
}

Or maybe even consider pulling out into own function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

R/tbl-lazy.R Outdated
@@ -201,7 +201,14 @@ distinct_.tbl_lazy <- function(.data, ..., .dots = list(), .keep_all = FALSE) {
add_op_join <- function(x, y, type, by = NULL, copy = FALSE,
suffix = c(".x", ".y"),
auto_index = FALSE, ...) {
by <- common_by(by, x, y)
by_intersect <- intersect(tbl_vars(x), tbl_vars(y))
by <- if (length(by_intersect) > 0 || !identical(type, "full") || !is.null(by)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here - I really don't like this style of if-else because it's easy to miss that the entire block gets assigned to some variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -92,3 +92,8 @@ test_that("sql generated correctly for all sources", {

expect_equal_tbls(xy)
})

test_that("full join is promoted to cross join for no overlapping variables", {
result <- df1 %>% full_join(df2) %>% collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like making this the default because if you accidentally supply the wrong data frames you might create a query that takes hours to run. Could we make it explicit, i.e. by = character()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@hadley hadley merged commit 3316463 into tidyverse:master Oct 26, 2017
@hadley
Copy link
Member

hadley commented Oct 26, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants