datafusion: UDFs marked as volatile do not appear to evaluate multiple times for each output row.

Describe the bug

It seems a UDF with no arguments are only called once, even if signature is defined with Volatility::Volatile and also queried in the context of a table with multiple rows.

By this I mean, for example: select random_udf() from many_rows_table

There is a minimal repro here https://github.com/dadepo/df-repro

When ran, the output could be:

+-------+------+-----+-------+
| index | uint | int | float |
+-------+------+-----+-------+
| 1     | 2    | -2  | 1.0   |
| 2     | 3    | 3   | 3.3   |
| 3     |      |     |       |
+-------+------+-----+-------+
+---------------+
| random_normal |
+---------------+
| 25.0          |
| 25.0          |
| 25.0          |
+---------------+

In the run above, the 25.0 is the result of the udf and it is repeated.

To Reproduce

A minimal reproduction can be found here https://github.com/dadepo/df-repro

Expected behavior

The UDF to be evaluated per each row output.

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

@alamb @dadepo I can reproduce the reported issue. Because it is a separate issue, I created #9032 for it. I also created a PR https://github.com/apache/arrow-datafusion/pull/9031 to fix it.

cc @viirya as he fixed https://github.com/apache/arrow-datafusion/issues/8518

In fact, now that I write this I wonder if the issue @dadepo is hitting is actually the same as https://github.com/apache/arrow-datafusion/issues/8518 (which is not yet released, but should be in the next few days #8863 ) 🤔

I will handle this issue 😄

I agree this is a bug. Thank you for the report @dadepo

Maybe @guojidan you have time to look into this issue as you have been looking at UDFs recently 🙏

I think the key thing a UDF writer needs to have is access to the number of output rows to produce. As long as they have this information we can leave it to them to implement the volatile semantics internally

I feel that it is not so reasonable to bind the nature of a volatile scalar function to some special way on expanding its arguments or how DataFusion treats its output. The definition of volatile scalar function doesn’t include such spec so by doing this we might create something weird to understand by outside DataFusion.

The special treatment of argument/output is specified to make_scalar_function which is useful for special cases of built-in scalar function to save time on implementation. However this assumption doesn’t apply generally to all use cases and that’s the problem as users are easily to misuse it.

The good news is we have ScalarUDFImpl now and we are moving to deprecate previous function definition of scalar UDF. I think for this issue we just need to deprecate make_scalar_function and make its doc more clear as I proposed in the #8878.

It is not because if it is volatile or not.

A ScalarUDF which is defined as no argument one or all scalar inputs, DataFusion will assume its output is a scalar (you also return a one element array from random_normal UDF, but even you don’t, DataFusion will take 0-index element from it to create a ScalarValue as output).

That’s why random_normal returns same value for all input rows in the batch (because it is treated as a scala udf returns a scalar value for all input rows).

@alamb I updated to include native random function https://github.com/dadepo/df-repro/commit/00feaed1069329680ffacc3ecd324fec9f2c2d45

and running that I get

+---------------+--------------------+
| random_normal | native_random      |
+---------------+--------------------+
| 82.0          | 0.2617618435183655 |
| 82.0          | 0.3157177219693452 |
| 82.0          | 0.3350644908486391 |
+---------------+--------------------+

Which indicates that, this works fine with native random function.

Could it also be the way UDF’s work? Because from my understanding when a udf has no arguments, when called, the arguments it is defined with is set to:

 [
    NullArray(1),
]

ie one Array, regardless of the number of rows in the table. Could this be a factor?