Skip to content

Support reading multiple parquet files via datafusion-cli #16303

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This is an idea that @robtandy brought up on the DataFusion sync call the other day and I think it would be pretty useful.

The usecase is "I want to read more than 1 but not an entire directory of parquet files from remote object store" -- I think in this case to look at some particular files

For example, let's say you want to read just these two files:

  • s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
  • s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_2.parquet

There is currently no way to do so via SQL. You can either do the entire directory

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 2.928 seconds.

Or you can read each file separately

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 1.017 seconds.

Describe the solution you'd like

I would like to be able to read an arbitrary set of remote parquet files

It would also be awesome to support GLOB files (e.g. *) which has been requested before

Describe alternatives you've considered

I suggest we implement a TableFunction similar to the DuckDB read_parquet file ONLY in the datafusion-cli source

So to query the files listed above, this would look like

SELECT * FROM read_parquet([
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', 
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet'
]);

From the duckdb docs: https://duckdb.org/docs/stable/data/parquet/overview.html

-- read file1, file2, file3
SELECT *
FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- Support GLOB access
SELECT *
FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);

We already support the parquet_metadata function in datafusion-cli (docs)

SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;

+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min           | stats_max           | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID"      | 0            | 450560             | 4611687214012840539 | 9223369186199968220 | 3883759               |
| "WatchID"      | 1            | 612174             | 4611689135232456464 | 9223371478009085789 | 5176803               |
| "WatchID"      | 2            | 344064             | 4611692774829951781 | 9223363791697310021 | 3031680               |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.

Here is the code implementation:

impl TableFunctionImpl for ParquetMetadataFunc {

We can also look at the ClickBench S3 command that is similar: https://clickhouse.com/docs/integrations/s3

DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames');

Open questions

What to do if the files are on different object stores (e.g. S3 and http):

SELECT * FROM read_parquet([
  'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet', 
  -- note a different object store 
  's3://public-datasets/hits_compatible/athena_partitioned/hits_2.parquet'
]);

At first I suggest we don't try and support this

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions