Description
Is your feature request related to a problem or challenge?
This is an idea that @robtandy brought up on the DataFusion sync call the other day and I think it would be pretty useful.
The usecase is "I want to read more than 1 but not an entire directory of parquet files from remote object store" -- I think in this case to look at some particular files
For example, let's say you want to read just these two files:
- s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
- s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_2.parquet
There is currently no way to do so via SQL. You can either do the entire directory
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 2.928 seconds.
Or you can read each file separately
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' options (aws.region 'eu-central-1');
0 row(s) fetched.
Elapsed 1.017 seconds.
Describe the solution you'd like
I would like to be able to read an arbitrary set of remote parquet files
It would also be awesome to support GLOB files (e.g. *
) which has been requested before
Describe alternatives you've considered
I suggest we implement a TableFunction
similar to the DuckDB read_parquet
file ONLY in the datafusion-cli
source
So to query the files listed above, this would look like
SELECT * FROM read_parquet([
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet',
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_2.parquet'
]);
From the duckdb docs: https://duckdb.org/docs/stable/data/parquet/overview.html
-- read file1, file2, file3
SELECT *
FROM read_parquet(['file1.parquet', 'file2.parquet', 'file3.parquet']);
-- Support GLOB access
SELECT *
FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
We already support the parquet_metadata
function in datafusion-cli (docs)
SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size
FROM parquet_metadata('hits.parquet')
WHERE path_in_schema = '"WatchID"'
LIMIT 3;
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
| "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 |
| "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 |
| "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 |
+----------------+--------------+--------------------+---------------------+---------------------+-----------------------+
3 rows in set. Query took 0.053 seconds.
Here is the code implementation:
datafusion/datafusion-cli/src/functions.rs
Line 322 in 85f6621
We can also look at the ClickBench
S3 command that is similar: https://clickhouse.com/docs/integrations/s3
DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_*.gz', 'TabSeparatedWithNames');
Open questions
What to do if the files are on different object stores (e.g. S3 and http):
SELECT * FROM read_parquet([
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet',
-- note a different object store
's3://public-datasets/hits_compatible/athena_partitioned/hits_2.parquet'
]);
At first I suggest we don't try and support this
Additional context
No response