velox: Thoughts on fixing semantic conflict issues for common functions

Description

As we know, for presto function under prestosql folder, if there are some semantic conflict issues for spark, we generally fix them by re-implementing another function under sparksql folder.

Recently, we found there are some semantic conflict issues in some common functions, like cast, e.g., spark allows decimal string, like "123.5", to be cast to int type, but currently velox produces null for such case. This issue can be fixed by introducing a config to let user to control the expected behavior. I note velox has a config called kCastIntByTruncate to fix a similar issue. But if multiple configs are introduced, it makes the code complex and hard to maintain. Another thought is, we can separate the implementations, like what we did under sparksql & prestosql folders. Such approach can produce a few repeated code.

@mbasmanova @pedroerp, do you have any idea? Thanks!

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Below is a table summarizing the differences between Spark and Presto CAST implementations we noticed. Some of them have been solved via config, like cast_to_int_by_truncate and cast_string_to_date_is_iso_8601.

Conversion Cases Presto result Spark result
floating-point -> integral SELECT cast(12345.67 as bigint); 12346 12345
SELECT cast(127.8 as tinyint); out of range 127
SELECT cast(129.9 as tinyint); out of range -127
SELECT cast(infinity() as bigint); out of range 9223372036854775807
decimal -> integral SELECT cast(2.56 decimal(6, 2) as integer); 3 2
SELECT cast(decimal ‘300.001’ as tinyint); out of range 44
string -> integral SELECT cast(‘12345.67’ as int); invalid 12345
string -> decimal select cast(’  1.23  ’ as decimal(38, 0)); invalid 123
string -> date select cast(‘2015’ as date); invalid 2015-01-01
and some other string patterns supported by Spark
timestamp -> string select cast(cast(‘2015-03-18 12:03:17’ as timestamp) as varchar); 2015-03-18 12:03:17.000 2015-03-18 12:03:17
string -> timestamp select cast(‘2015-03-18T12:03:17’ as timestamp) invalid 2015-03-18 12:03:17
select cast(‘0001’ as timestamp) invalid 0001-01-01 00:00:00
and some other string patterns supported by Spark
string -> integral/floating-point/decimal/date/timestamp SELECT cast(‘\t\n\u001F 123\u000B’ as int); invalid 123
and cast from string to other types, which also requires removing leading/trailing utf8 white-space before cast

There are a lot of cases listed in the below link for demonstrating presto’s cast behavior. Will do a check for spark. https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/presto/conversion.rst

@pedroerp This might be a good idea. Maybe someone can prototype this to see what it looks like.

@mbasmanova @PHILO-HE given the many corner case differences between the cast behavior between these engines, I wonder if we should provide a nicer way for engines to specialize the cast code instead of relying on flags. Like scalar functions, maybe we could provide a baseline cast implementation (the base CastExpr class) following, say, Presto semantic, and let engines overwrite its specific conversion methods?

Hi @mbasmanova, it seems there is no spark doc to describe the semantics for cast functions. I will try to summarize one. Thanks!