datafusion: Mismatch between schema and batches on a CREATE TABLE with a windowing query
Describe the bug
This doesn’t work but should:
DataFusion CLI v20.0.0
❯ create table temp as with orders as (
select 1 as o_custkey
)
SELECT RANK() OVER (PARTITION BY o_custkey)
FROM orders;
Error during planning: Mismatch between schema and batches
To Reproduce
See above
Expected behavior
This should not throw the mismatch error
Additional context
http://sqlfiddle.com/#!17/1d310/1
Note: if I slap round(…) around the window expression, it begins to work:
DataFusion CLI v20.0.0
❯ create table temp as with orders as (
select 1 as o_custkey
)
SELECT round(RANK() OVER (PARTITION BY o_custkey), 5)
FROM orders;
0 rows in set. Query took 0.012 seconds.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (10 by maintainers)
you are right, col names from first union all branch are the driving
This case is not correct, col names has to be
count, n_regionkey
If I remove order by I’m getting even more surprising
The bug partially related to wrong col name derivation in UNION ALL
I will prepare a fix for UNION ALL first and then test out other scenarios, like not deterministic column naming with and without ORDER BY
I’m looking into this today!
@milevin I have looked into the code and another workaround, more natural is to give an alias
The code currently uses alias if its given or shortened the name to prevent huge unreadable names. @alamb I’m not sure tbh if we should revert https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/core/src/physical_plan/planner.rs#L1630
Great, thank you for looking into this! I also discovered a similar workaround (and added it into the Additional context).
We might have observed the same issue outside of windowing functions; I’ll see if I can create more repros.