prql: Same column name in multiple tables causes buggy behaviour

When multiple PRQL tables share an identical column name (e.g. id): joining them together causes the compiler to reuse the meaning of one id in multiple places (even where it’s incorrect).

This reduced example should illustrate it best:

table x = (
  from x_orig # table with columns: id
  derive [
    x_id = id
  ]
  select [x_id]
)

table y = (
  from y_orig # table with columns: id, x_id
  derive [
    y_id = id
  ]
  select [y_id, x_id]
)

from x
join y [x_id]

compiled SQL:

WITH x AS (
  SELECT
    id AS x_id
  FROM
    x_orig
),
y AS (
  SELECT
    x_id AS y_id, --------- this should instead be `id AS y_id`
    x_id
  FROM
    y_orig
)
SELECT
  x.*,
  y.*,
  x_id
FROM
  x
  JOIN y USING(x_id)

Note that if the PRQL derive statement of y_id = id is changed to y_id = y_orig.id, then the buggy output SQL line correctly becomes id AS y_id. But, we can’t expect users to always add these (unless this becomes an explicit requirement)

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 24 (14 by maintainers)

Commits related to this issue

internal: Add `fold_query` to ASTFold I think to fix #820 we're going to need to know "where we are" in the folding — what tables we've seen before, what is currently visible; which I think means we ... — committed to PRQL/prql by max-sixty 2 years ago
fix: Don't quote uppercase strings This is a temporary workaround for @mklopets & team, while we resolve #820. @aljazerzen -- not sure if you think this is an acceptable tradeoff? I'll wait for your... — committed to PRQL/prql by max-sixty 2 years ago

Most upvoted comments

it looks like #908 greatly improved the situation but there’s still similar buggy behaviour:

table y = (
  from x
  select [
    x_id = x.x_id,
  ]
)

from x
join y [something]

select [
  x_id = x.x_id # this should compile to selecting x.x_id, not y.x_id 
]

currently compiles to

WITH y AS (
  SELECT
    x_id
  FROM
    x
)
SELECT
  y.x_id <--------------------------- this is incorrect, should be `x.x_id`
FROM
  x
  JOIN y USING(something)

this is problematic e.g. if left joining y – y.x_id is null for rows that don’t match there (this example code query isn’t perfect, but our real query is hundreds of lines long)

mklopets on Aug 19, 2022

This was truly fixed with v0.3.0!

mklopets on Nov 30, 2022

Here’s one more example of something where working around this issue is very difficult (if not impossible):

from x

filter field > 0 # this causes an intermediary table_0 to be created

join y [something]

select [
  # if this has to use an S-string because of issue #820,
  # then this incorrectly outputs x.field instead of table_0.field,
  # which will error: missing FROM-clause entry for table "x"
  field = s"{x}.field"
]

If #820 wasn’t a thing, then there would be no need for the S-string, but if we do need it (due to this very same issue here), then this will compile to:

WITH table_0 AS (
  SELECT
    x.*
  FROM
    x
  WHERE
    field > 0
)
SELECT
  x.field AS field <---------------- this will error: missing FROM-clause entry for table "x"
FROM
  table_0
  JOIN y USING(something)

as a terrible workaround, one could do:

-  field = s"{x}.field"
+  field = s"{x.*}.field"

which compiles to table_0.*.field; because there are no valid uses for .*. (AFAIK), one could theoretically replace every occurrence of .*. in the compiled SQL with .

mklopets on Nov 27, 2022

I’ve also found PRQL -> SQL debugging approach quite lacking. I fall back on dbg! and prql-compiler debug command, which is quite good at explaining what gets resolved into what.

Yeah, just to give you an idea of how inefficient I’m being — here’s the current dbg diff!

diff --git a/prql-compiler/src/semantic/name_resolver.rs b/prql-compiler/src/semantic/name_resolver.rs
index 11c61bdd..9a9ebaee 100644
--- a/prql-compiler/src/semantic/name_resolver.rs
+++ b/prql-compiler/src/semantic/name_resolver.rs
@@ -97,6 +97,14 @@ fn fold_nodes(&mut self, items: Vec<Node>) -> Result<Vec<Node>> {
     }
 
     fn fold_node(&mut self, mut node: Node) -> Result<Node> {
+        if node
+            .item
+            .as_ident()
+            .map_or(false, |x| x == "foo" || x == "y")
+        {
+            dbg!(&node);
+        }
+
         let r = match node.item {
             Item::FuncCall(ref func_call) => {
                 // find declaration
@@ -104,16 +112,20 @@ fn fold_node(&mut self, mut node: Node) -> Result<Node> {
                     self.lookup_variable(&func_call.name, node.span)
                         .map_err(|e| Error::new(Reason::Simple(e)).with_span(node.span))?,
                 );
-
                 self.fold_function_call(node)?
             }
 
             Item::Ident(ref ident) => {
+                if node.item.as_ident().map_or(false, |x| x == "foo") {
+                    dbg!(&node);
+                }
                 node.declared_at = Some(
                     (self.lookup_variable(ident, node.span))
                         .map_err(|e| Error::new(Reason::Simple(e)).with_span(node.span))?,
                 );
-
+                if node.item.as_ident().map_or(false, |x| x == "foo") {
+                    dbg!(&node);
+                }
                 // convert ident to function without args
                 let decl = &self.context.declarations.0[node.declared_at.unwrap()].0;
                 if matches!(decl, Declaration::Function(_)) {
@@ -122,6 +134,9 @@ fn fold_node(&mut self, mut node: Node) -> Result<Node> {
                         args: vec![],
                         named_args: Default::default(),
                     });
+                    if node.item.as_ident().map_or(false, |x| x == "foo") {
+                        dbg!(&node);
+                    }
                     self.fold_function_call(node)?
                 } else {
                     node
@@ -263,6 +278,8 @@ fn fold_join_filter(&mut self, filter: JoinFilter) -> Result<JoinFilter> {
 
     fn fold_table(&mut self, mut table: Table) -> Result<Table> {
         // fold pipeline
+        dbg!("folding table", &table.name);
+        // Run a separate name resolution process for the table, to ensure
         table.pipeline = Box::new(self.fold_node(*table.pipeline)?);
 
         // declare table
@@ -507,6 +524,9 @@ fn validate_function_call(
     pub fn lookup_variable(&mut self, ident: &str, span: Option<Span>) -> Result<usize, String> {
         let (namespace, variable) = split_var_name(ident);
 
+        if ident == "foo" {
+            dbg!(&self.context.scope.variables);
+        }
         if let Some(decls) = self.context.scope.variables.get(ident) {
             // lookup the inverse index
 
@@ -514,7 +534,7 @@ pub fn lookup_variable(&mut self, ident: &str, span: Option<Span>) -> Result<usi
                 0 => unreachable!("inverse index contains empty lists?"),
 
                 // single match, great!
-                1 => Ok(decls.iter().next().cloned().unwrap()),
+                1 => Ok(dbg!(decls.iter().next().cloned().unwrap())),
 
                 // ambiguous
                 _ => {

After sleeping on it, I think we should do something like the above suggestion — let us engage / test with these inner APIs without needing a full PRQL statement. That means making them easy to create, easy to serialize, etc. So a test can be:

fn test_context() {
    context = Context::new();
    let names = resolve_names(<small AST>);
    assert_snapshot(names, @"<a couple of lines long>");
}

…and then we can test it more directly

max-sixty on Aug 3, 2022