velox: Thoughts on fixing semantic conflict issues for common functions
Description
As we know, for presto function under prestosql folder, if there are some semantic conflict issues for spark, we generally fix them by re-implementing another function under sparksql folder.
Recently, we found there are some semantic conflict issues in some common functions, like cast, e.g., spark allows decimal string, like "123.5", to be cast to int type, but currently velox produces null for such case. This issue can be fixed by introducing a config to let user to control the expected behavior. I note velox has a config called kCastIntByTruncate to fix a similar issue. But if multiple configs are introduced, it makes the code complex and hard to maintain. Another thought is, we can separate the implementations, like what we did under sparksql & prestosql folders. Such approach can produce a few repeated code.
@mbasmanova @pedroerp, do you have any idea? Thanks!
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- Add Spark cast expr based on cast hooks (#7377) Summary: There are several corner cases of semantic differences on cast between Presto and Spark. According to the implementation of `registerFunctionC... — committed to facebookincubator/velox by rui-mo 6 months ago
- Add Spark cast expr based on cast hooks (#7377) Summary: There are several corner cases of semantic differences on cast between Presto and Spark. According to the implementation of `registerFunctionC... — committed to liujiayi771/velox by rui-mo 6 months ago
Below is a table summarizing the differences between Spark and Presto CAST implementations we noticed. Some of them have been solved via config, like
cast_to_int_by_truncateandcast_string_to_date_is_iso_8601.There are a lot of cases listed in the below link for demonstrating presto’s cast behavior. Will do a check for spark. https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/presto/conversion.rst
Find a new semantic conflict https://github.com/facebookincubator/velox/pull/5307#discussion_r1307094037
@pedroerp This might be a good idea. Maybe someone can prototype this to see what it looks like.
@mbasmanova @PHILO-HE given the many corner case differences between the cast behavior between these engines, I wonder if we should provide a nicer way for engines to specialize the cast code instead of relying on flags. Like scalar functions, maybe we could provide a baseline cast implementation (the base CastExpr class) following, say, Presto semantic, and let engines overwrite its specific conversion methods?
Hi @mbasmanova, it seems there is no spark doc to describe the semantics for cast functions. I will try to summarize one. Thanks!