prql: Same column name in multiple tables causes buggy behaviour
When multiple PRQL tables share an identical column name (e.g. id): joining them together causes the compiler to reuse the meaning of one id in multiple places (even where it’s incorrect).
This reduced example should illustrate it best:
table x = (
from x_orig # table with columns: id
derive [
x_id = id
]
select [x_id]
)
table y = (
from y_orig # table with columns: id, x_id
derive [
y_id = id
]
select [y_id, x_id]
)
from x
join y [x_id]
compiled SQL:
WITH x AS (
SELECT
id AS x_id
FROM
x_orig
),
y AS (
SELECT
x_id AS y_id, --------- this should instead be `id AS y_id`
x_id
FROM
y_orig
)
SELECT
x.*,
y.*,
x_id
FROM
x
JOIN y USING(x_id)
Note that if the PRQL derive statement of y_id = id is changed to y_id = y_orig.id, then the buggy output SQL line correctly becomes id AS y_id. But, we can’t expect users to always add these (unless this becomes an explicit requirement)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 24 (14 by maintainers)
Commits related to this issue
- internal: Add `fold_query` to ASTFold I think to fix #820 we're going to need to know "where we are" in the folding — what tables we've seen before, what is currently visible; which I think means we ... — committed to PRQL/prql by max-sixty 2 years ago
- fix: Don't quote uppercase strings This is a temporary workaround for @mklopets & team, while we resolve #820. @aljazerzen -- not sure if you think this is an acceptable tradeoff? I'll wait for your... — committed to PRQL/prql by max-sixty 2 years ago
it looks like #908 greatly improved the situation but there’s still similar buggy behaviour:
currently compiles to
this is problematic e.g. if left joining
y–y.x_idisnullfor rows that don’t match there (this example code query isn’t perfect, but our real query is hundreds of lines long)This was truly fixed with
v0.3.0!Here’s one more example of something where working around this issue is very difficult (if not impossible):
If #820 wasn’t a thing, then there would be no need for the S-string, but if we do need it (due to this very same issue here), then this will compile to:
as a terrible workaround, one could do:
which compiles to
table_0.*.field; because there are no valid uses for.*.(AFAIK), one could theoretically replace every occurrence of.*.in the compiled SQL with.Yeah, just to give you an idea of how inefficient I’m being — here’s the current
dbgdiff!After sleeping on it, I think we should do something like the above suggestion — let us engage / test with these inner APIs without needing a full PRQL statement. That means making them easy to create, easy to serialize, etc. So a test can be:
…and then we can test it more directly