Description
For TPC-H query 17 there is an issue with scalar subqueries referencing fields from the parent query:
select
sum(l.l_extendedprice) / 7.0 as avg_yearly
from
"lineitem" l,
"part" p
where
p.p_partkey = l.l_partkey
and p.p_brand = 'Brand#13'
and p.p_container = 'JUMBO CAN'
and l.l_quantity < (
select
0.2 * avg(l2.l_quantity)
from
"lineitem" l2
where
# the following line throws an exception in Calcite
# java.lang.AssertionError: RexInputRef index 16 out of range 0..15
l2.l_partkey = p.p_partkey
)
The way this works in Calcite is that Calcite assigns a correlation variable
to the filter relation $cor0
and then it references the field p.p_partkey
via this correlation variable $cor0.p_partkey
.
Now normally with the Calcite RelBuilder one would assign a correlation variable to a relation in the following way:
RelBuilder relBuilder;
final Holder<@Nullable RexCorrelVariable> v = Holder.empty();
relBuilder.scan(...).variable(v::set).filter(... builder.field(v.get(), "p_partkey") ...).;
Now the challenge with substrait-java is that it has two distinct visitors: one for converting Substrait relations to Calcite relations SubstraitRelNodeConverter.java and one for converting Substrait expression to Calcite expressions ExpressionRexConverter.java. The expression conversion has currently no link bank to the relation it belongs to or the inputs of the relation.
Today, in Substrait the p_partkey
field is being described as an outer reference
field in the following snippet which shows the filter condition of the subquery. The first field is l2.l_partkey
and the second is p.p_partkey
:
"condition": {
"scalarFunction": {
"functionReference": 1,
"outputType": {
"bool": {
"nullability": "NULLABILITY_REQUIRED"
}
},
"arguments": [{
"value": {
"selection": {
"directReference": {
"structField": {
"field": 1
}
},
"rootReference": {
}
}
}
}, {
"value": {
"selection": {
"directReference": {
"structField": {
"field": 16
}
},
"outerReference": {
"stepsOut": 1
}
}
}
}]
}
}
These Substrait field references are turned into Calcite field references in the ExpressionRexConverter
which today can not handle outer references:
Solution
When the filter
relation is being processed, it needs to create a Calcite correlation variable and pass that down to the expression converter. They may be multiple ones of these therefore, a list is being used.
Note that the builder sequence needs to be split. As coded the original sequence is
relBuilder.scan(...).variable(v::set).filter(... builder.field(v.get(), "p_partkey") ...).;
The v::set
is called but after the expression has been visited. The solution here to split up the builder and the creation of the filterCondition
- it's the creation of the filter condition that references the creation of the RexFieldAccess
node.
Therefore, recode like this
RelBuilder r1 = relBuilder.push(input).variable(v::set); // correl variable is created here
RexNode filterCondition = filter.getCondition().accept(expressionRexConverter); // expression is access here but the variable is now already set
RelNode r1.filter(....).build()
This gives the basic Calcite relation structure as being correct; the list of variables also needs to be passed to the filter
node on create
RelBuilder r3 = r2.filter(List.of(v.get().id),filterCondition);
This then creates the SQL as expected.