datafusion: `upper` (and other string functions) don't support String `Dictionary` types: Internal error: The "upper" function can only accept strings.
Describe the bug
Running upper(col)
where col is a dictionary results in an internal error:
Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
To Reproduce Requires the arrow_cast function in https://github.com/apache/arrow-datafusion/pull/5166
DataFusion CLI v19.0.0
❯ select upper('foo');
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO |
+--------------------+
1 row in set. Query took 0.004 seconds.
❯ select upper(arrow_cast('foo', 'Dictionary(Int32, Utf8)'));
Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
Expected behavior
The test should pass and produce FOO
Additional context Reported by @sanderson at https://github.com/influxdata/docs-v2/pull/4773#issuecomment-1452256546
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (20 by maintainers)
https://docs.rs/arrow-array/latest/arrow_array/array/struct.DictionaryArray.html#method.with_values might be useful here
This query now works in DataFusion so closing this ticket. Sorry for the confusion
Do you mean something in DataFusion?
Perhaps here: https://github.com/apache/arrow-datafusion/blob/00627785718d9d98998021bf44585f32c33af3ea/datafusion/physical-expr/src/functions.rs#L340
Are you suggesting we handle evaluating scalar functions on DictionaryArrays in the generic function evaluate function rather than doing something special for string functions?
So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays. I’m not sure if this is currently specialized to the array type, but it definitely could be.
It should therefore be trivial to make this same function recurse for the dictionary values type, and then return a type-erased function that operates on dictionaries by using the type-erased values function and manipulating its output.
This would be simpler, require minimal additional codegen, and naturally generalise to all unary functions and argument types.
TLDR you shouldn’t need to write any additional code specialized on anything other than the dictionary arrays themselves
I presume this is what you intend to do, but I would recommend implementing this by evaluating the function on the dictionary values using the existing kernels, and then passing the result from this into the take kernel with the dictionary keys as the second argument. This should not only be less code, but will avoid a codegen explosion from parameterising generic code on both dictionary key types and value types
The code for these functions is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/string_expressions.rs
As an initial step, I recommend creating some sqllogictests reproducing the issue. sqllogictest is explained here: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/sqllogictests
Perhaps by extending what is in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt