datafusion: Allowing to expose sort order of parquet file to datafusion does not work
Describe the bug
I am trying to perform a range join on two parquet files with datafusion-cli, but it makes an unrealistic query plan.
select * from points, intervals where pt between first and last order by pt;
I try to express the sort order in both parquet files when registering the parquet files with datafusion-cli.
However, I am somehow stuck with expressing this sort order as the with order (columnname ordering)
clause seems to be unusable together with parquet files.
To Reproduce
The points table has 1 numeric column and the intervals table has two columns. With duckdb I generated parquet files for both:
copy (select generate_series as pt from (select * from generate_series(1, 100000000)) t) TO 'pts.parquet' (FORMAT PARQUET);
copy (select generate_series as first, first +2 as last from (select * from generate_series (2, 100, 5))) TO 'intervals.parquet' (FORMAT PARQUET);
(which I both ran through parquet-rewrite to add page level statistics).
parquet-rewrite --input pts.parquet --output pts_stats.parquet --writer-version 2.0 --statistics-enabled page
parquet-rewrite --input intervals.parquet --output intervals_stats.parquet --writer-version 2.0 --statistics-enabled page
I would expect that the sort order present in both files could help very much to query efficiently (I would pick a sort-merge-join, but without the sort of the files being necessary, as the data is already sorted, so just the merge part with scanning for the overlapping parts of the table).
Hence, I tried to register the parquet files with datafusion-cli and express that the files are already sorted.
This works:
CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet';
But this does not:
CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;
It errors with:
Error during planning: Provide a schema before specifying the order while creating a table.
If I try to specify the schema manually:
CREATE EXTERNAL TABLE points
(pt bigint)
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;
It errors with:
Error during planning: Column definitions can not be specified for PARQUET files.
Which is not surprising.
Expected behavior
I would expect that the sort order can be expressed and that the query planner subsequently can execute a sort-merge-join, where the sort step would be unnecessary (as the data of both sides of the join is already sorted).
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (12 by maintainers)
@metesynnada
Currently schemas are always resolved for Parquet files, so there shouldn’t be a need to specify the schema. I’m inclined to agree with @bmmeijers here, but you might have some context that says otherwise.
During parsing, can we just ignore the fact that the schema is inferred later? I’m not familiar with the implementation of
WITH ORDER
. Still looking at it.This is currently supported for CSV files but not Parquet. I think this would be good first issue for new contributors.