datafusion: Error "entered unreachable code: NamedStructField should be rewritten in OperatorToFunction" after upgrade to 37
NOTE – Here is an example of how to make Expr::NamedStructField
work in 37.1.0: https://github.com/apache/datafusion/pull/10183
Is your feature request related to a problem or challenge?
In 37.0.0 many of the built in functions have been migrated to UDF
s as described on #8045 . The migration is completed in 38.0.0
One part of this change is that now certain Expr
s must be rewritten into the appropriate functions. Most notably get_field
that extracts a field from a Struct
Among other things this allows people to customize how Expr behaves: https://github.com/apache/datafusion/issues/7845#issuecomment-2066290235 or in slack to return NULLs for rows that don’t pass in maps
The rewrite happens automatically as part of the logical planner (in the Analyzer pass)
However if you bypass those passes it will not happen
Yeah you need to use the FunctionRewriter here (with the relevant rewriter registered) https://github.com/apache/arrow-datafusion/blob/0573f78c7e7a4d94c3204cee464b3860479e0afb/datafusion/optimizer/src/analyzer/function_rewrite.rs#L33
Example
An example from discord: link is:
let schema = Schema::new(vec![
Field::new("id", DataType::Utf8, true),
Field::new(
"props",
DataType::Struct(Fields::from(vec![Field::new("a", DataType::Utf8, true)])),
true,
),
]);
println!("schema {:?}", schema);
let df_schema = DFSchema::try_from(schema.clone()).unwrap();
let plan = table_scan(Some("props_test"), &schema, None)?
.filter(col("props").field("a").eq(lit("2021-02-02")))?
.build()?;
println!("logical plan {:?}", plan);
let phys = DefaultPhysicalPlanner::default().create_physical_expr(&plan.expressions()[0], &df_schema, &SessionContext::new().state())?;
println!("phys {:?}", phys);
Ok(())
This returns an error “NamedStructField should be rewritten in OperatorToFunction”
Describe the solution you’d like
No response
Describe alternatives you’ve considered
One potential workaround is to call get_field
directly rather than Expr::field
So instead of
let plan = table_scan(Some("props_test"), &schema, None)?
.filter(col("props").field("a").eq(lit("2021-02-02")))?
.build()?;
call like
let plan = table_scan(Some("props_test"), &schema, None)?
.filter(get_field(col("props", "a")).eq(lit("2021-02-02")))?
.build()?;
Additional context
@ion-elgreco is seeing the same issue in Delta-rs: https://github.com/apache/datafusion/issues/9904#issuecomment-2069359171
I tried it with 37.1.0 in delta-rs, but we still get this error: internal error: entered unreachable code: NamedStructField should be rewritten in OperatorToFunction, wasn’t this regression fixed?
@westonpace brings it up in discord link
Another report in discord: link
About this issue
- Original URL
- State: open
- Created 2 months ago
- Comments: 19 (16 by maintainers)
I am thinking I’ll try and make a PR with such an API over the next day or two to see how it might look
I had also thought about deprecating
Expr
and use functions directly in parsing phase. I think it might be a good idea. The downside of the current impl is that one needs to register function rewrite rule to convert Expr to Function. I think Register is better not a neccessary step for default behavior. If we have functions in parsing phase, no additional step (register rewrite) is needed.One good reason is that we don’t need to care about physical expr if we converted it to functions in logical level. I think the early we optimize, the less duplicated things we leave. if we move this rewrite to
logical-to-physical
step, we need to map it to somewhat of physical-expr, which apparently is not a better idea.If we have functions after parsing, they can rewrite functions to their expected one through register their own rewrite rules, so I think it is also not a problem
“rewritten / simplified logical plan” is actually “optimized logical plan” to me, “optimized” is more general term.
I think we are talking about better API design of
get_field
right? I think we can take reference from others. Duckdb has struct_extract and map_extractI think, for me, it’s just a general unease with having multiple ways of expressing the same thing. I feel like this can lead to “implicit layers” of the plan. For example, there is already some notion of “parse plan”, “unoptimized logical plan” and “optimized logical plan”, and “physical plan”. The middle two are both represented by
Expr
which can be subtle. Do we now add “rewritten logical plan” to the list? Or maybe “rewritten” and “simplified” are just very transient states between “unoptimized” and “optimized” and I am blowing things out of proportion.Another way to tackle it could be to leave the concept of a
GetIndexedField
node at the parsing layer and pull it out ofExpr
(or deprecate). This would force the conversion to be done between the parse plan and the logical plan.That being said, my needs are met (thanks again for your help), and perfect is the enemy of the good, so I’m happy to leave well enough alone.
I agree (and thanks for the example). This works for us. All of our expr from the user start as
&str
. However, if we were receivingExpr
directly then it wouldn’t work because we wouldn’t know which approach the user used. This is not a problem for us, we aren’t expecting the user to provide direct DF structs in any of our roadmap features, I’m just thinking through possibilities.What’s the motivation for doing this at the logical level instead of doing this as part of the conversion from logical to physical?