velox: Equality comparison for complex types doesn't match Presto semantics

Bug description

The results of comparing complex types in Velox do not match Presto when complex type values contain nulls.

Here are some examples using an array(bigint) type:

Velox result matches Presto:

presto:di> select array[1, null, 3] = array[1, 2, 3];
 _col0
-------
 NULL 

presto:di> select array[1, null, 3] = array[1, 2];
 _col0
-------
 false

presto:di> select array[1, null, 3] = array[1, null, 3];
 _col0
-------
 NULL

Velox returns NULL while Presto returns false:

presto:di> select array[1, null, 3] = array[1, null, 2];
 _col0
-------
 false

Velox unit test to reproduce:

TEST_F(ExprTest, xxx) {
  auto data = makeRowVector({
      makeArrayVectorFromJson<int32_t>({
          "[1, null, 3]",
          "[1, null, 3]",
          "[1, null, 3]",
          "[1, null, 3]",
      }),
      makeArrayVectorFromJson<int32_t>({
          "[1, 2, 3]",
          "[1, 2]",
          "[1, null, 3]",
          "[1, null, 2]",
      }),
  });

  auto r = evaluate("c0 = c1", data);
  LOG(ERROR) << r->toString();
  LOG(ERROR) << r->toString(0, 10);
}

0: null
1: false
2: null
3: null

Presto Java implementation of equality for arrays: com/facebook/presto/operator/scalar/ArrayEqualOperator.java

CC: @bikramSingh91 @laithsakka @kgpai @kagamiori

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 1
  • Comments: 23 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@mbasmanova Spark has the same NULL (not nested null) comparison semantics as Presto. And Spark provides a null-safe equal operator (<=>) https://spark.apache.org/docs/3.3.1/sql-ref-null-semantics.html

Left Operand Right Operand > >= = < <= <=>
NULL Any value NULL NULL NULL NULL NULL False
Any value NULL NULL NULL NULL NULL NULL False
NULL NULL NULL NULL NULL NULL NULL True

In SQL, NULL indicates unknown value. When evaluating an expression with NULL, it may not be possible to arrive at a concrete answer, in which case the result is NULL (unknown).

For example, ‘NULL AND true’ is NULL because the answer can be true if unknown value is true or the answer can be false if unknown value is false.

SELECT true AND true; -- true
SELECT false AND true; -- false

SELECT NULL AND true; -- NULL 

On the other hand, ‘NULL AND false’ is false because whether unknown value is true or false, the answer is still false.

SELECT true AND false; -- false
SELECT false AND false; -- false

SELECT NULL AND false; -- false

This principle applies to any expression. The answer is NULL if there are 2 possible values of unknown value that produce different results.

Hence,

  • comparing (=, <, <=, >=) NULL scalar to any other value produces NULL.
  • AND-ing NULL with true is NULL. AND-ing NULL with false is false.
  • OR-ing NULL with true is true. OR-ing NULL with false is NULL.
  • NULL IN (<list of values>) is NULL if list is not empty and false otherwise.
  • non-NULL IN (<list of values>) is true if list contains the value, false if list doesn’t contain the value and doesn’t contain NULL, NULL if list doesn’t contain the value but contains a NULL.

Some examples,

SELECT null = 5; -- NULL

SELECT null < 5; -- NULL

SELECT null <= 5; -- NULL

SELECT null > 5; -- NULL

SELECT null >= 5; -- NULL

SELECT NULL and true; -- NULL

SELECT NULL and false; -- false

SELECT NULL or true; -- true

SELECT NULL or false; -- NULL

SELECT NULL in (1, 2, 3) -- NULL

SELECT 1 in (1, NULL, 3); -- true

SELECT 2 in (1, NULL, 3); -- NULL

Sorting a list of scalar values with NULLs produces a sorted list of non-NULL values with NULLs placed either at the start of the end of the list (based on NULLS FIRST / LAST clause).

Let’s consider comparison of arrays. Equality comparison returns false if array sizes don’t match. If sizes match and there is an index at which elements of the arrays compare false, the result is false. If all elements compare true, the result is true. If no elements compare false, but some compare NULL, the result is NULL.

boolean equals(a: array(T), b: array(T)):

  if (a.size != b.size) {
    return false;
  }

  boolean hasNull = false;
  for i : range(0, a.size) {
    r = equals(a[i], b[i])

    if (r IS NULL) {
      hasNull = true;
    } else if (r IS FALSE) {
      return false
    }
  }

  return hasNull ? NULL : true

Examples:

SELECT array[1, 2, 3] = array[1, null, 3]; -- NULL

SELECT array[1, null, 4] = array[1, null, 3]; -- false

Less than comparison for arrays goes like this:

boolean lessThan(a: array(T), b: array(T)):
  n = min(a.size, b.size)

  for i : range(0, n):
    r = lessThan(a[i], b[i])

    if (r IS NULL) {
      return NULL
    } 

    if (r >= 0) {
      return false
    }

  return a.size < b.size

Examples:

SELECT array[1, 2] < array[1, 3, null]; -- true

SELECT array[1, null] < array[1, 3, null]; -- NULL

SELECT array[1, null, null] < array[1, 3, null]; -- NULL

But there is a catch. Presto doesn’t return NULL from <, >, <=, >=, but throws.

presto:di> SELECT array[1, 2] < array[1, 2, null];
 _col0
-------
 true

presto:di> select array[1, 2, null] < array[1, 2, null];
Query 20231115_134916_30771_ymfsk failed: ARRAY comparison not supported for arrays with null elements

presto:di> SELECT array[1, null, null] < array[1, 3];
Query 20231115_135317_31020_ymfsk failed: ARRAY comparison not supported for arrays with null elements

Yet, Presto returns NULL for =

presto:di> SELECT array[1, 2, null] = array[1, 2, null];
 _col0
-------
 NULL

presto:di> SELECT array[1, 2, null] < array[1, 2, null];
Query 20231115_142116_31619_ymfsk failed: ARRAY comparison not supported for arrays with null elements

See https://github.com/prestodb/presto/issues/20965 issue in PrestoDB.

Now, let’s consider sorting array values (not sorting elements within an array, but sorting lists of arrays). Sorting requires a less-than operator that returns either true or false. The less-than operator we have may also return NULL, hence, it cannot be used for sorting.

For example, what would be the result of sorting a list of 3 arrays: [[1, null, 2], [1, null, 3], [2, 3]]? [1, null, 2] is clearly less than [2, 3]. [1, null, 3] is also less than [2, 3]. Hence, [2, 3] goes last. Now, which of [1, null, 2], and [1, null, 3] goes first? This cannot be decided because it depends on what is that unknown value (NULL). It can be [1, 1, 2] vs. [1, 3, 3] or [1, 3, 2] vs. [1, 1, 3]. It is not possible to produce a sorted list of arrays if some of the arrays cannot be compared (e.g. comparator returns NULL).

For scalar values, a solution to this problem is to specify whether NULL values come first or last in the sorted list. This works because there is a single scalar value that doesn’t compare to other values (NULL). This solution doesn’t work for arrays because there are many values that do not compare to each other and therefore we cannot simply say that every value that compares as NULL goes first or last. In the example above, we can’t define whether the result should be [[1, null, 2], [1, null, 3], [2, 3]] or [[1, null, 3], [1, null, 2], [2, 3]] and having non-deterministic sort is… not very useful.

A possible solution is to say that sorting arrays that contain nulls is not supported and throw an exception if one or more arrays in the list contain nulls. The complete semantics than would be the following:

  • Comparing two arrays (=, <, >, <=, >=) returns true, false, or NULL (see details above).
  • Sorting lists of arrays is supported as long as none of the arrays contains a null (an array can be null, but cannot contain a null). Sorting uses < comparison defined above.

@duanmeng Thank you for sharing Spark results for array comparison. Is this the case that Spark never returns a NULL (if neither arrays are NULL)?

@mbasmanova I believe so.

It seems that current Velox behavior doesn’t match Spark semantics either. I also see that Spark doesn’t register equalto function for complex types.

I am not sure about this, I will investigate it.

@mbasmanova It seems that the null-handing mode is NoStop in Spark, while Presto applies a mix of NoStop and StopAtNull modes.

scala> spark.sql("select array(1, null, 3) = array(1, 2, 3)").show()
+------------------------------------+                                          
|(array(1, NULL, 3) = array(1, 2, 3))|
+------------------------------------+
|                               false|
+------------------------------------+

scala> spark.sql("select array(1, null, 3) = array(1, 2)").show()
+---------------------------------+                                             
|(array(1, NULL, 3) = array(1, 2))|
+---------------------------------+
|                            false|
+---------------------------------+

scala> spark.sql("select array(1, null, 3) = array(1, null, 3)").show()
+---------------------------------------+                                       
|(array(1, NULL, 3) = array(1, NULL, 3))|
+---------------------------------------+
|                                   true|
+---------------------------------------+

scala> spark.sql("select array(1, null, 3) = array(1, null, 2)").show()
+---------------------------------------+
|(array(1, NULL, 3) = array(1, NULL, 2))|
+---------------------------------------+
|                                  false|
+---------------------------------------+