Skip to content

[isthmus] subqueries/set predicates with field references outside of the subquery fail #382

Open
@mbwhite

Description

@mbwhite

For TPC-H query 17 there is an issue with scalar subqueries referencing fields from the parent query:

select
  sum(l.l_extendedprice) / 7.0 as avg_yearly
from
  "lineitem" l,
  "part" p
where
  p.p_partkey = l.l_partkey
  and p.p_brand = 'Brand#13'
  and p.p_container = 'JUMBO CAN'
  and l.l_quantity < (
    select
      0.2 * avg(l2.l_quantity)
    from
      "lineitem" l2
    where
# the following line throws an exception in Calcite
# java.lang.AssertionError: RexInputRef index 16 out of range 0..15
      l2.l_partkey = p.p_partkey
  )

The way this works in Calcite is that Calcite assigns a correlation variable to the filter relation $cor0 and then it references the field p.p_partkey via this correlation variable $cor0.p_partkey.

Now normally with the Calcite RelBuilder one would assign a correlation variable to a relation in the following way:

RelBuilder relBuilder;

final Holder<@Nullable RexCorrelVariable> v = Holder.empty();

relBuilder.scan(...).variable(v::set).filter(...  builder.field(v.get(), "p_partkey") ...).;

Now the challenge with substrait-java is that it has two distinct visitors: one for converting Substrait relations to Calcite relations SubstraitRelNodeConverter.java and one for converting Substrait expression to Calcite expressions ExpressionRexConverter.java. The expression conversion has currently no link bank to the relation it belongs to or the inputs of the relation.

Today, in Substrait the p_partkey field is being described as an outer reference field in the following snippet which shows the filter condition of the subquery. The first field is l2.l_partkey and the second is p.p_partkey:

"condition": {
  "scalarFunction": {
    "functionReference": 1,
    "outputType": {
      "bool": {
        "nullability": "NULLABILITY_REQUIRED"
      }
    },
    "arguments": [{
      "value": {
        "selection": {
          "directReference": {
            "structField": {
             "field": 1
            }
          },
          "rootReference": {
          }
        }
      }
    }, {
      "value": {
        "selection": {
          "directReference": {
            "structField": {
              "field": 16
            }
          },
          "outerReference": {
            "stepsOut": 1
          }
        }
      }
    }]
  }
}

These Substrait field references are turned into Calcite field references in the ExpressionRexConverter which today can not handle outer references:

public RexNode visit(FieldReference expr) throws RuntimeException {
if (expr.isSimpleRootReference()) {
var segment = expr.segments().get(0);
RexInputRef rexInputRef;
if (segment instanceof FieldReference.StructField f) {
rexInputRef =
new RexInputRef(f.offset(), typeConverter.toCalcite(typeFactory, expr.getType()));
} else {
throw new IllegalArgumentException("Unhandled type: " + segment);
}
return rexInputRef;
}
return visitFallback(expr);
}

Solution

When the filter relation is being processed, it needs to create a Calcite correlation variable and pass that down to the expression converter. They may be multiple ones of these therefore, a list is being used.

Note that the builder sequence needs to be split. As coded the original sequence is

relBuilder.scan(...).variable(v::set).filter(...  builder.field(v.get(), "p_partkey") ...).;

The v::set is called but after the expression has been visited. The solution here to split up the builder and the creation of the filterCondition - it's the creation of the filter condition that references the creation of the RexFieldAccess node.

Therefore, recode like this

RelBuilder r1 = relBuilder.push(input).variable(v::set);  // correl variable is created here
RexNode filterCondition = filter.getCondition().accept(expressionRexConverter); // expression is access here but the variable is now already set
RelNode r1.filter(....).build()

This gives the basic Calcite relation structure as being correct; the list of variables also needs to be passed to the filter node on create

 RelBuilder r3 = r2.filter(List.of(v.get().id),filterCondition);

This then creates the SQL as expected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions