dask: read_parquet filter bug with nulls

Describe the issue: When using dd.read_parquet on a column with nulls: filtering on null equality doesn’t work, != on another value also ends up removing nulls

Minimal Complete Verifiable Example:

dd.read_parquet("test.parquet", filters=[("a", "==", np.nan)]).compute() # empty
dd.read_parquet("test.parquet", filters=[("a", "!=", np.nan)]).compute() # works
dd.read_parquet("test.parquet", filters=[("a", "!=", 1)]).compute() # empty
dd.read_parquet("test.parquet", filters=[("a", "==", None)]).compute() # empty
dd.read_parquet("test.parquet", filters=[("b", "!=", 2 )]).compute() # 13 rows instead of 14
dd.read_parquet("test.parquet", filters=[("c", "!=", "a")]).compute() # empty
dd.read_parquet("test.parquet", filters=[("c", "!=", None)]).compute() # empty
dd.read_parquet("test.parquet", filters=[("c", "!=", "")]).compute() # works

For table creation:

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

from dask import dataframe as dd

df = pd.DataFrame()
df["a"] = [1, None]*5 + [None]*5
df["b"] = np.arange(14).tolist() + [None]
df["c"] = ["a", None]*2 + [None]*11
pa_table = pa.Table.from_pandas(df)
pq.write_table(pa_table, "test.parquet", row_group_size=10)

Anything else we need to know?:

Environment:

Dask version: 2023.1.1
Python version: 3.9
Operating System: ubuntu 20.04
Install method (conda, pip, source): pip

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (22 by maintainers)

Most upvoted comments

That is, it would be nice if the internal convert_single_predicate helper could apply the necessary is_null method when val is np.nan or None

~It’s up to the dask developers to decide (since this helper lives here), but~ personally I am not fully convinced this is a good idea. To be honest, I think this is where the simple list-of-tuples with (col, op, val) notation falls short. Also with plain numpy, you can’t check for NaNs directly with such a construct (one could do col != col, but then the value is not a scalar, or something like “col < -inf”), and that’s the reason that np.isnan exists. Allowing a (“col”, “==”, np.nan) in the filters, thus using an expression that doesn’t actually work that way outside of this context, might be confusing/misleading.

Also as a sidenote: be aware that in Parquet you also actually have NaN in addition to null. So for the direct pyarrow.dataset interface, one can do both is_null or is_nan.

jorisvandenbossche on Jan 18, 2023