datafusion: `upper` (and other string functions) don't support String `Dictionary` types: Internal error: The "upper" function can only accept strings.

Describe the bug Running upper(col) where col is a dictionary results in an internal error:

Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

To Reproduce Requires the arrow_cast function in https://github.com/apache/arrow-datafusion/pull/5166

DataFusion CLI v19.0.0
❯ select upper('foo');
+--------------------+
| upper(Utf8("foo")) |
+--------------------+
| FOO                |
+--------------------+
1 row in set. Query took 0.004 seconds.
❯ select upper(arrow_cast('foo', 'Dictionary(Int32, Utf8)'));
Internal error: The "upper" function can only accept strings.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior The test should pass and produce FOO

Additional context Reported by @sanderson at https://github.com/influxdata/docs-v2/pull/4773#issuecomment-1452256546

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (20 by maintainers)

Most upvoted comments

This query now works in DataFusion so closing this ticket. Sorry for the confusion

So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays

Do you mean something in DataFusion?

Perhaps here: https://github.com/apache/arrow-datafusion/blob/00627785718d9d98998021bf44585f32c33af3ea/datafusion/physical-expr/src/functions.rs#L340

Are you suggesting we handle evaluating scalar functions on DictionaryArrays in the generic function evaluate function rather than doing something special for string functions?

So the current state of play is we have a function to obtain a type erased function that operates on ColumnarValue and produces a ColumnarValue for non-dictionary arrays. I’m not sure if this is currently specialized to the array type, but it definitely could be.

It should therefore be trivial to make this same function recurse for the dictionary values type, and then return a type-erased function that operates on dictionaries by using the type-erased values function and manipulating its output.

This would be simpler, require minimal additional codegen, and naturally generalise to all unary functions and argument types.

TLDR you shouldn’t need to write any additional code specialized on anything other than the dictionary arrays themselves

I presume this is what you intend to do, but I would recommend implementing this by evaluating the function on the dictionary values using the existing kernels, and then passing the result from this into the take kernel with the dictionary keys as the second argument. This should not only be less code, but will avoid a codegen explosion from parameterising generic code on both dictionary key types and value types