duckdb: When reading parquet, filter using IN () on two values much slower than equals on single value
What happens?
When reading a parquet dataset (no specific partitioning, so no pruning) via parquet_scan() , applying a filter via column in (value1, value2) is much slower than a single value via column = value.
Significantly slower than running the query twice (once per distinct value)
There is also some inconsistent behavior between duckdb.query() and conn = duckdb.connect; conn.execute() where the former goes parallel when using an IN clause while the latter does not. Adding order by to the latter enables parallelism back.
Running parallel or doing something like create temp table as .. provides good performance back, at the cost of a serious increase in machine resource though (e.g. the testcase runs in 1s for a single value on a single core, 1.5s with 2 values using 24 cores, would take ~2 sec to run the query twice on a single core)
To be clear, I’m raising this for the much slower perf of = vs IN (), not duckdb.query vs conn.execute
To Reproduce
This testcase produces ~1.5GB of data, with larger directories the effect is magnified approx linearly
import numpy as np
import pandas as pd
import duckdb
conn = duckdb.connect()
a = np.random.randint(0, 100, (100_000_000, 10))
df = pd.DataFrame(a, columns=[f"c{i}" for i in range(1,11)])
# manually partitioning, example could be high cardinality partitioning column that has multiple values in the same slice
# (or a dataset that is consistently appended to)
for i in range(10, 101, 10):
df.query(f"c1 >= {i-10} and c1 <{i}").to_parquet(f"/tmp/pq/{i}.parquet")
# reading a single value the performance is about the same and does need / use parallism
%timeit -n3 -r2 duckdb.query("select * from parquet_scan('/tmp/pq/*.parquet') where c1 = 2").df()
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 = 2").df()
# with in() the performance is drastically different, even when running fast (duckdb.query) it takes a lot more resources
%timeit -n3 -r2 duckdb.query("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)").df()
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)").df()
# adding order by allows parallelism in conn.execute() and it's faster but takes the whole machine
%timeit -n3 -r2 conn.execute("select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25) order by c1").df()
# the mismatch does NOT reproduce creating a temp table, but it does go parallel and ends up using the whole machine
%timeit -n1 -r1 duckdb.query("create temp table t_duckdb as select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)")
%timeit -n1 -r1 conn.execute("create temp table t_conn as select * from parquet_scan('/tmp/pq/*.parquet') where c1 in (2, 25)")
OS:
Linux
DuckDB Version:
0.4.1-dev1175
DuckDB Client:
Python
Full Name:
Mauro Pagano
Affiliation:
Independent consultant
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 15 (6 by maintainers)
Bump to keep open
Is this problem planned to be solved? I hope it can be solved as soon as possible
I think it’s somewhat realistic for us to add AND filter semantics to the parquet reader. Of course there is a pathological case where there are thousands of those filters in an IN list or AND-ed together, but we can deal with that later.
I’m having the same issue in #9251, it would be nice if this is fixed as it’s one of the most basic and common operations for filtering. I’m willing to work on a fix if somebody can point me to the right direction.