datafusion: Error "entered unreachable code: NamedStructField should be rewritten in OperatorToFunction" after upgrade to 37

NOTE – Here is an example of how to make Expr::NamedStructField work in 37.1.0: https://github.com/apache/datafusion/pull/10183

Is your feature request related to a problem or challenge?

In 37.0.0 many of the built in functions have been migrated to UDFs as described on #8045 . The migration is completed in 38.0.0

One part of this change is that now certain Exprs must be rewritten into the appropriate functions. Most notably get_field that extracts a field from a Struct

Among other things this allows people to customize how Expr behaves: https://github.com/apache/datafusion/issues/7845#issuecomment-2066290235 or in slack to return NULLs for rows that don’t pass in maps

The rewrite happens automatically as part of the logical planner (in the Analyzer pass)

However if you bypass those passes it will not happen

Yeah you need to use the FunctionRewriter here (with the relevant rewriter registered) https://github.com/apache/arrow-datafusion/blob/0573f78c7e7a4d94c3204cee464b3860479e0afb/datafusion/optimizer/src/analyzer/function_rewrite.rs#L33

Example

An example from discord: link is:

  let schema = Schema::new(vec![
        Field::new("id", DataType::Utf8, true),
        Field::new(
            "props",
            DataType::Struct(Fields::from(vec![Field::new("a", DataType::Utf8, true)])),
            true,
        ),
    ]);

    println!("schema {:?}", schema);

    let df_schema = DFSchema::try_from(schema.clone()).unwrap();

    let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(col("props").field("a").eq(lit("2021-02-02")))?
        .build()?;
    println!("logical plan {:?}", plan);
    let phys = DefaultPhysicalPlanner::default().create_physical_expr(&plan.expressions()[0], &df_schema, &SessionContext::new().state())?;
    println!("phys {:?}", phys);
    Ok(())

This returns an error “NamedStructField should be rewritten in OperatorToFunction”

Describe the solution you’d like

No response

Describe alternatives you’ve considered

One potential workaround is to call get_field directly rather than Expr::field

So instead of

    let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(col("props").field("a").eq(lit("2021-02-02")))?
        .build()?;

call like

  let plan = table_scan(Some("props_test"), &schema, None)?
        .filter(get_field(col("props", "a")).eq(lit("2021-02-02")))?
        .build()?;

Additional context

@ion-elgreco is seeing the same issue in Delta-rs: https://github.com/apache/datafusion/issues/9904#issuecomment-2069359171

I tried it with 37.1.0 in delta-rs, but we still get this error: internal error: entered unreachable code: NamedStructField should be rewritten in OperatorToFunction, wasn’t this regression fixed?

@westonpace brings it up in discord link

Another report in discord: link

About this issue

Original URL
State: open
Created 2 months ago
Comments: 19 (16 by maintainers)

Most upvoted comments

I am thinking I’ll try and make a PR with such an API over the next day or two to see how it might look

alamb on Apr 23, 2024

done between the parse plan and the logical plan

I had also thought about deprecating Expr and use functions directly in parsing phase. I think it might be a good idea. The downside of the current impl is that one needs to register function rewrite rule to convert Expr to Function. I think Register is better not a neccessary step for default behavior. If we have functions in parsing phase, no additional step (register rewrite) is needed.

What’s the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

One good reason is that we don’t need to care about physical expr if we converted it to functions in logical level. I think the early we optimize, the less duplicated things we leave. if we move this rewrite to logical-to-physical step, we need to map it to somewhat of physical-expr, which apparently is not a better idea.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

If we have functions after parsing, they can rewrite functions to their expected one through register their own rewrite rules, so I think it is also not a problem

Do we now add “rewritten logical plan” to the list?

“rewritten / simplified logical plan” is actually “optimized logical plan” to me, “optimized” is more general term.

I think we need a better API to do this for real (in 38.0.0 and going forward). I will think about this – maybe @jayzhan211 has some thoughts

I think we are talking about better API design of get_field right? I think we can take reference from others. Duckdb has struct_extract and map_extract

jayzhan211 on Apr 23, 2024

I don’t think there is any particular motivation (or any reason that the conversion needs to be done at either spot) 🤔

I think, for me, it’s just a general unease with having multiple ways of expressing the same thing. I feel like this can lead to “implicit layers” of the plan. For example, there is already some notion of “parse plan”, “unoptimized logical plan” and “optimized logical plan”, and “physical plan”. The middle two are both represented by Expr which can be subtle. Do we now add “rewritten logical plan” to the list? Or maybe “rewritten” and “simplified” are just very transient states between “unoptimized” and “optimized” and I am blowing things out of proportion.

Another way to tackle it could be to leave the concept of a GetIndexedField node at the parsing layer and pull it out of Expr (or deprecate). This would force the conversion to be done between the parse plan and the logical plan.

That being said, my needs are met (thanks again for your help), and perfect is the enemy of the good, so I’m happy to leave well enough alone.

westonpace on Apr 23, 2024

I think this can be controlled by the consumer – for example if you are walking Exprs in lancedb, you can control when you transform Expr::GetStructField into ScalarUDF and depending on where you do your analysis you only have to check for one

I agree (and thanks for the example). This works for us. All of our expr from the user start as &str. However, if we were receiving Expr directly then it wouldn’t work because we wouldn’t know which approach the user used. This is not a problem for us, we aren’t expecting the user to provide direct DF structs in any of our roadmap features, I’m just thinking through possibilities.

I think the idea is people might want to override Expr::GetStructFields semantics and they way they would do so is to rewrite it into a different function. I think this is especially compelling for supporting JSON/JSONB for example

What’s the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?

westonpace on Apr 22, 2024