datafusion: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Describe the bug

This doesn’t work but should:

DataFusion CLI v20.0.0
❯ create table temp as with orders as (
  select 1 as o_custkey
)
SELECT RANK() OVER (PARTITION BY o_custkey)
FROM orders;
Error during planning: Mismatch between schema and batches

To Reproduce

See above

Expected behavior

This should not throw the mismatch error

Additional context

http://sqlfiddle.com/#!17/1d310/1

Note: if I slap round(…) around the window expression, it begins to work:

DataFusion CLI v20.0.0
❯ create table temp as with orders as (
  select 1 as o_custkey
)
SELECT round(RANK() OVER (PARTITION BY o_custkey), 5)
FROM orders;
0 rows in set. Query took 0.012 seconds.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

you are right, col names from first union all branch are the driving

This case is not correct, col names has to be count, n_regionkey

❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
select count(*) count, n_regionkey from w2 group by n_regionkey
union all
select x, y from w1 order by n_regionkey, count desc;
+-------+----+
| count | y  |
+-------+----+
| 1     | 5  |
| 1     | 10 |
+-------+----+

If I remove order by I’m getting even more surprising

❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
select count(*) count, n_regionkey from w2 group by n_regionkey
union all
select x, y from w1;
+---+----+
| x | y  |
+---+----+
| 1 | 10 |
| 1 | 5  |
+---+----+

The bug partially related to wrong col name derivation in UNION ALL

❯ select  1 a, 2 b union all select 3 c, 4 d
;
+---+---+
| c | d |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+

I will prepare a fix for UNION ALL first and then test out other scenarios, like not deterministic column naming with and without ORDER BY

I’m looking into this today!

@milevin I have looked into the code and another workaround, more natural is to give an alias

❯ create table temp1 as with orders as (
  select 1 as o_custkey
)
SELECT RANK() OVER (PARTITION BY o_custkey) as a
FROM orders;
0 rows in set. Query took 0.010 seconds.

The code currently uses alias if its given or shortened the name to prevent huge unreadable names. @alamb I’m not sure tbh if we should revert https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/core/src/physical_plan/planner.rs#L1630

Great, thank you for looking into this! I also discovered a similar workaround (and added it into the Additional context).

We might have observed the same issue outside of windowing functions; I’ll see if I can create more repros.