duckdb: When reading parquet, filter using IN () on two values much slower than equals on single value

What happens?

When reading a parquet dataset (no specific partitioning, so no pruning) via parquet_scan() , applying a filter via column in (value1, value2) is much slower than a single value via column = value. Significantly slower than running the query twice (once per distinct value)

There is also some inconsistent behavior between duckdb.query() and conn = duckdb.connect; conn.execute() where the former goes parallel when using an IN clause while the latter does not. Adding order by to the latter enables parallelism back.

Running parallel or doing something like create temp table as .. provides good performance back, at the cost of a serious increase in machine resource though (e.g. the testcase runs in 1s for a single value on a single core, 1.5s with 2 values using 24 cores, would take ~2 sec to run the query twice on a single core)

To be clear, I’m raising this for the much slower perf of = vs IN (), not duckdb.query vs conn.execute

To Reproduce

This testcase produces ~1.5GB of data, with larger directories the effect is magnified approx linearly

import numpy as np
import pandas as pd
import duckdb

conn = duckdb.connect()
a = np.random.randint(0, 100, (100_000_000, 10))
df = pd.DataFrame(a, columns=[f"c{i}" for i in range(1,11)])
# manually partitioning, example could be high cardinality partitioning column that has multiple values in the same slice
# (or a dataset that is consistently appended to)
for i in range(10, 101, 10):
     df.query(f"c1 >= {i-10} and c1 <{i}").to_parquet(f"/tmp/pq/{i}.parquet")

# reading a single value the performance is about the same and does need / use parallism
%timeit -n3 -r2 duckdb.query("select * from parquet_scan('/tmp/pq/*.parquet') where c1 = 2").df()
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 = 2").df()

# with in() the performance is drastically different, even when running fast (duckdb.query) it takes a lot more resources
%timeit -n3 -r2 duckdb.query("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)").df()
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)").df()

# adding order by allows parallelism in conn.execute() and it's faster but takes the whole machine
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25) order by c1").df()

# the mismatch does NOT reproduce creating a temp table, but it does go parallel and ends up using the whole machine
%timeit -n1 -r1 duckdb.query("create temp table t_duckdb as select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)")
%timeit -n1 -r1 conn.execute("create temp table t_conn as select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)")

OS:

Linux

DuckDB Version:

0.4.1-dev1175

DuckDB Client:

Python

Full Name:

Mauro Pagano

Affiliation:

Independent consultant

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

About this issue

Original URL
State: open
Created 2 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

Bump to keep open

pbrady32 on Jan 9, 2024

Is this problem planned to be solved? I hope it can be solved as soon as possible

kiwimg on Sep 18, 2022

I think it’s somewhat realistic for us to add AND filter semantics to the parquet reader. Of course there is a pathological case where there are thousands of those filters in an IN list or AND-ed together, but we can deal with that later.

hannes on Oct 10, 2023

I’m having the same issue in #9251, it would be nice if this is fixed as it’s one of the most basic and common operations for filtering. I’m willing to work on a fix if somebody can point me to the right direction.

shner-elmo on Oct 9, 2023