velox: Equality comparison for complex types doesn't match Presto semantics
Bug description
The results of comparing complex types in Velox do not match Presto when complex type values contain nulls.
Here are some examples using an array(bigint) type:
Velox result matches Presto:
presto:di> select array[1, null, 3] = array[1, 2, 3];
_col0
-------
NULL
presto:di> select array[1, null, 3] = array[1, 2];
_col0
-------
false
presto:di> select array[1, null, 3] = array[1, null, 3];
_col0
-------
NULL
Velox returns NULL while Presto returns false:
presto:di> select array[1, null, 3] = array[1, null, 2];
_col0
-------
false
Velox unit test to reproduce:
TEST_F(ExprTest, xxx) {
auto data = makeRowVector({
makeArrayVectorFromJson<int32_t>({
"[1, null, 3]",
"[1, null, 3]",
"[1, null, 3]",
"[1, null, 3]",
}),
makeArrayVectorFromJson<int32_t>({
"[1, 2, 3]",
"[1, 2]",
"[1, null, 3]",
"[1, null, 2]",
}),
});
auto r = evaluate("c0 = c1", data);
LOG(ERROR) << r->toString();
LOG(ERROR) << r->toString(0, 10);
}
0: null
1: false
2: null
3: null
Presto Java implementation of equality for arrays: com/facebook/presto/operator/scalar/ArrayEqualOperator.java
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 1
- Comments: 23 (20 by maintainers)
Commits related to this issue
- Rename NoStop null handling mode to NullAsValue Summary: This is the first diff of a sequence of diffs implementing the proposed solution in https://github.com/facebookincubator/velox/issues/7536#iss... — committed to laithsakka/velox by laithsakka 8 months ago
- Rename NoStop null handling mode to NullAsValue Summary: This is the first diff of a sequence of diffs implementing the proposed solution in https://github.com/facebookincubator/velox/issues/7536#iss... — committed to laithsakka/velox by laithsakka 8 months ago
- Rename NoStop null handling mode to NullAsValue (#7599) Summary: This is the first diff of a sequence of diffs implementing the proposed solution in https://github.com/facebookincubator/velox/issues... — committed to laithsakka/velox by laithsakka 7 months ago
- Rename NoStop null handling mode to NullAsValue (#7599) Summary: Pull Request resolved: https://github.com/facebookincubator/velox/pull/7599 This is the first diff of a sequence of diffs implementin... — committed to facebookincubator/velox by laithsakka 7 months ago
- Fix presto null handling in comparison operations. Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://github.com/fa... — committed to laithsakka/velox by laithsakka 7 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 7 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 7 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Velox used to depend to use the StopAtNull mode to implement presto semantics of comparisons. However as discussed in https://git... — committed to laithsakka/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Pull Request resolved: https://github.com/facebookincubator/velox/pull/7703 Velox used to depend to use the StopAtNull mode to imp... — committed to facebookincubator/velox by laithsakka 6 months ago
- Fix presto null handling in comparison operations. (#7703) Summary: Pull Request resolved: https://github.com/facebookincubator/velox/pull/7703 Velox used to depend to use the StopAtNull mode to imp... — committed to liujiayi771/velox by laithsakka 6 months ago
@mbasmanova Spark has the same NULL (not nested null) comparison semantics as Presto. And Spark provides a null-safe equal operator (
<=>
) https://spark.apache.org/docs/3.3.1/sql-ref-null-semantics.htmlIn SQL, NULL indicates unknown value. When evaluating an expression with NULL, it may not be possible to arrive at a concrete answer, in which case the result is NULL (unknown).
For example, ‘NULL AND true’ is NULL because the answer can be true if unknown value is true or the answer can be false if unknown value is false.
On the other hand, ‘NULL AND false’ is false because whether unknown value is true or false, the answer is still false.
This principle applies to any expression. The answer is NULL if there are 2 possible values of unknown value that produce different results.
Hence,
<list of values>
) is NULL if list is not empty and false otherwise.<list of values>
) is true if list contains the value, false if list doesn’t contain the value and doesn’t contain NULL, NULL if list doesn’t contain the value but contains a NULL.Some examples,
Sorting a list of scalar values with NULLs produces a sorted list of non-NULL values with NULLs placed either at the start of the end of the list (based on NULLS FIRST / LAST clause).
Let’s consider comparison of arrays. Equality comparison returns false if array sizes don’t match. If sizes match and there is an index at which elements of the arrays compare false, the result is false. If all elements compare true, the result is true. If no elements compare false, but some compare NULL, the result is NULL.
Examples:
Less than comparison for arrays goes like this:
Examples:
But there is a catch. Presto doesn’t return NULL from <, >, <=, >=, but throws.
Yet, Presto returns NULL for =
See https://github.com/prestodb/presto/issues/20965 issue in PrestoDB.
Now, let’s consider sorting array values (not sorting elements within an array, but sorting lists of arrays). Sorting requires a less-than operator that returns either true or false. The less-than operator we have may also return NULL, hence, it cannot be used for sorting.
For example, what would be the result of sorting a list of 3 arrays: [[1, null, 2], [1, null, 3], [2, 3]]? [1, null, 2] is clearly less than [2, 3]. [1, null, 3] is also less than [2, 3]. Hence, [2, 3] goes last. Now, which of [1, null, 2], and [1, null, 3] goes first? This cannot be decided because it depends on what is that unknown value (NULL). It can be [1, 1, 2] vs. [1, 3, 3] or [1, 3, 2] vs. [1, 1, 3]. It is not possible to produce a sorted list of arrays if some of the arrays cannot be compared (e.g. comparator returns NULL).
For scalar values, a solution to this problem is to specify whether NULL values come first or last in the sorted list. This works because there is a single scalar value that doesn’t compare to other values (NULL). This solution doesn’t work for arrays because there are many values that do not compare to each other and therefore we cannot simply say that every value that compares as NULL goes first or last. In the example above, we can’t define whether the result should be [[1, null, 2], [1, null, 3], [2, 3]] or [[1, null, 3], [1, null, 2], [2, 3]] and having non-deterministic sort is… not very useful.
A possible solution is to say that sorting arrays that contain nulls is not supported and throw an exception if one or more arrays in the list contain nulls. The complete semantics than would be the following:
@mbasmanova I believe so.
I am not sure about this, I will investigate it.
@mbasmanova It seems that the null-handing mode is
NoStop
in Spark, while Presto applies a mix ofNoStop
andStopAtNull
modes.