Skip to content

[ISTHMUS] Out-of-spec plans returned by Isthmus #18

Open
@jvanstraten

Description

@jvanstraten

While testing the validator against Isthmus output for TPC-H, I found a number of issues where Isthmus disagrees with the spec as written on the website. The ones I found are:

  • The grouping set index column appended by the aggregate relation isn't removed from the output, resulting in excess columns both functionally and in the RelRoot column name list. Failing example: select l_returnflag, count(*) from lineitem group by l_returnflag. The easy solution is just to generate an emit clause for all aggregate relations.
  • Decimal literals are generated with 38 rather than 16 bytes. Failing example: select 0.1
  • Negative decimal literals are not sign extended all the way to the MSB. Failing example: select -0.1, which yields out-of-bounds value 25.5 for P=2 S=1.
  • For whatever reason, some decimal literals are just emitted as zero. Failing example: select 000000.000001.
  • The emitted function signatures don't follow the table here for the abbreviated names. For example, decimal is used instead of dec, and any1 is used in place of just any. Failing example: literally anything with a function, such as select 0.1 + 1.0, which generates add:opt_decimal_decimal instead of add:opt_dec_dec.
  • Virtual tables receive a schema representing a single column of structs containing the fields. Failing example: select 1 is given the schema STRUCT<STRUCT<i32>>. The type itself is also invalid because of this, because only one field name is attached when two would be needed for that type. It should be NSTRUCT<EXPR$0: i32> here.
  • Arguably this is not a spec violation (the spec never mentions how this should be done), but using anchor value 0 is IMO a really bad idea, because on the reference side it's indistinguishable from not referencing anything. For type variations it would actually be an issue, because "no variation" would be indistinguishable from the type variation with anchor 0, and if any optional references are ever added for the other anchor types it would become a problem for them too. It's also confusing when reading a plan where the reference is "optimized out" due to having the default value, and it's easy to forget to add a reference because nothing can check for this error. Simply starting the counter from 1 instead of 0 prevents all this and barely costs anything (specifically, it costs ~0.00000003% of the anchor code space, and a couple bytes in the binary serialization).

I haven't touched anything java for about a decade, so I'm probably not the right person to propose fixes for these things.

Here's a script to easily run Isthmus and generate a validation report for the result. Just run pip install . at https://github.com/jvanstraten/substrait-validator/tree/initial/py to install the validator, and chuck the script in the root of substrait-java. Call the script without arguments to generate reports for all TPC-H queries (it will try them all, but will only generate reports for successful runs), or with a SQL query as cmdline argument for a custom query based on the TPC-H database schema. The reports are written to the tpch-output directory.

tpch.py
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0

import os
import sys
import shutil
import subprocess
import json
import substrait_validator

def format_html_code_block(text, lang):
    text = text.replace('&', '&amp')
    text = text.replace('<', '&lt;')
    text = text.replace('>', '&gt;')
    text = text.replace('"', '&quot;')
    text = text.replace("'", '&apos;')
    text = text.replace(" ", '&nbsp;')
    text = text.replace("\t", '&nbsp;&nbsp;&nbsp;&nbsp;')
    text = text.replace("\n", '<br/>\n')
    return f'<pre><code class="language-{lang}">{text}</code></pre>'

_EXTRA_HEAD = """
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.5.0/styles/default.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.5.0/highlight.min.js"></script>
"""

if __name__ == '__main__':

    # Figure out environment.
    script_path = os.path.dirname(os.path.realpath(__file__))
    output_path = os.path.join(script_path, 'tpch-output')
    repo_path = script_path
    java_path = repo_path
    tpch_path = os.path.join(java_path, 'isthmus', 'src', 'test', 'resources', 'tpch')
    schema_path = os.path.join(tpch_path, 'schema.sql')
    query_path = os.path.join(tpch_path, 'queries')
    isthmus_path = os.path.join(java_path, 'isthmus', 'build', 'graal', 'isthmus')

    # Create output directory.
    if not os.path.isdir(output_path):
        os.makedirs(output_path)

    # Build isthmus if it hasn't already been built.
    if not os.path.isfile(isthmus_path):
        cur = os.curdir
        os.chdir(java_path)
        try:
            subprocess.run(['./gradlew', 'nativeImage'], check=True)
        finally:
            os.chdir(cur)

    # Load TPC-H schema files.
    isthmus_args = [isthmus_path]
    with open(schema_path, 'r') as f:
        for query in filter(bool, map(str.strip, f.read().split(';'))):
            isthmus_args.append('-c')
            isthmus_args.append(query)

    # Function to run Isthmus and the validator on some named query.
    def run_isthmus(name, query):
        try:

            # Convert to Substrait plan with Isthmus.
            result = subprocess.run(isthmus_args + [query], capture_output=True)
            if result.returncode != 0:
                raise Exception(result.stderr.decode('utf-8'))
            plan = result.stdout.decode('utf-8')

            # Convert to HTML validation result with validator.
            html = substrait_validator.plan_to_html(plan)

            # Unpack the HTML a bit so we can add stuff to it.
            html_gen_a, remain = html.split('</head>', maxsplit=1)
            html_gen_b, remain = remain.split('<body>', maxsplit=1)
            html_gen_b = f'</head>{html_gen_b}<body>'
            html_gen_c, html_gen_d = remain.split('</body>', maxsplit=1)
            html_gen_d = f'</body>{html_gen_d}'

            # Add our stuff to it.
            html = []
            html.append(html_gen_a)
            html.append(f'<title>{name}</title>')
            html.append(_EXTRA_HEAD)
            html.append(html_gen_b)
            html.append(f'<h1>{name} with Isthmus</h1>')
            html.append(f'<h2>SQL query</h2>')
            html.append(format_html_code_block(query, 'sql'))
            html.append(f'<h2>Plan JSON</h2>')
            html.append(format_html_code_block(plan, 'json'))
            html.append(f'<h2>Validation result</h2>')
            html.append(html_gen_c)
            html.append('<script>hljs.highlightAll();</script>')
            html.append(html_gen_d)
            html = '\n'.join(html)

            # Write file.
            html_fname = os.path.join(output_path, f'{name}.html')
            with open(html_fname, 'w') as f:
                f.write(html)

        except Exception as e:
            print(f'{type(e).__name__} for {name}: {e}')

    if len(sys.argv) <= 1:

        # Run isthmus and the validator for all queries.
        for query_fname in sorted(os.listdir(query_path)):
            name = query_fname.split('.')[0]
            with open(os.path.join(query_path, query_fname), 'r') as f:
                query = f.read()
            run_isthmus(name, query)

    else:

        # Interpret command line as a query.
        run_isthmus('cmdline', ' '.join(sys.argv[1:]))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions