datafusion: UDFs marked as volatile do not appear to evaluate multiple times for each output row.
Describe the bug
It seems a UDF with no arguments are only called once, even if signature is defined with Volatility::Volatile
and also queried in the context of a table with multiple rows.
By this I mean, for example: select random_udf() from many_rows_table
There is a minimal repro here https://github.com/dadepo/df-repro
When ran, the output could be:
+-------+------+-----+-------+
| index | uint | int | float |
+-------+------+-----+-------+
| 1 | 2 | -2 | 1.0 |
| 2 | 3 | 3 | 3.3 |
| 3 | | | |
+-------+------+-----+-------+
+---------------+
| random_normal |
+---------------+
| 25.0 |
| 25.0 |
| 25.0 |
+---------------+
In the run above, the 25.0
is the result of the udf and it is repeated.
To Reproduce
A minimal reproduction can be found here https://github.com/dadepo/df-repro
Expected behavior
The UDF to be evaluated per each row output.
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 19 (12 by maintainers)
@alamb @dadepo I can reproduce the reported issue. Because it is a separate issue, I created #9032 for it. I also created a PR https://github.com/apache/arrow-datafusion/pull/9031 to fix it.
cc @viirya as he fixed https://github.com/apache/arrow-datafusion/issues/8518
In fact, now that I write this I wonder if the issue @dadepo is hitting is actually the same as https://github.com/apache/arrow-datafusion/issues/8518 (which is not yet released, but should be in the next few days #8863 ) 🤔
I will handle this issue 😄
I agree this is a bug. Thank you for the report @dadepo
Maybe @guojidan you have time to look into this issue as you have been looking at UDFs recently 🙏
I think the key thing a UDF writer needs to have is access to the number of output rows to produce. As long as they have this information we can leave it to them to implement the volatile semantics internally
I feel that it is not so reasonable to bind the nature of a volatile scalar function to some special way on expanding its arguments or how DataFusion treats its output. The definition of volatile scalar function doesn’t include such spec so by doing this we might create something weird to understand by outside DataFusion.
The special treatment of argument/output is specified to
make_scalar_function
which is useful for special cases of built-in scalar function to save time on implementation. However this assumption doesn’t apply generally to all use cases and that’s the problem as users are easily to misuse it.The good news is we have
ScalarUDFImpl
now and we are moving to deprecate previous function definition of scalar UDF. I think for this issue we just need to deprecatemake_scalar_function
and make its doc more clear as I proposed in the #8878.It is not because if it is volatile or not.
A ScalarUDF which is defined as no argument one or all scalar inputs, DataFusion will assume its output is a scalar (you also return a one element array from
random_normal
UDF, but even you don’t, DataFusion will take 0-index element from it to create aScalarValue
as output).That’s why
random_normal
returns same value for all input rows in the batch (because it is treated as a scala udf returns a scalar value for all input rows).@alamb I updated to include native random function https://github.com/dadepo/df-repro/commit/00feaed1069329680ffacc3ecd324fec9f2c2d45
and running that I get
Which indicates that, this works fine with native random function.
Could it also be the way UDF’s work? Because from my understanding when a udf has no arguments, when called, the arguments it is defined with is set to:
ie one Array, regardless of the number of rows in the table. Could this be a factor?