duckdb: Very slow join on small table (bad choice of join algorithm?)
What happens?
Very slow join on a small table.
This is extracted from a larger query on a multi-billion row table. The performance behavior can be observed at any scale though.
To Reproduce
See attached zip: slow.zip
SELECT t1.col1
FROM read_parquet('slow3.parquet') AS t1
LEFT JOIN read_parquet('slow3.parquet') AS t2
ON t1.col1 = t2.col1
-- The following condition is always false on the reproducer dataset.
-- On the real world dataset, it is sometimes false.
AND t1.col1 = 'missing'
GROUP BY t1.col1
Timing (small/large):
DuckDB: 9/35 s chdb: 15/20 ms
NB: I don’t know how to write the query above in Polars, but here are timing for a query without AND t1.col1 = 'missing':
DuckDB: 10 s / 40 s chdb: 1.5 s / 23 s Polars (code below): 7 s / OOM
df = pl.scan_parquet("slow3.parquet")
df.join(df, on=["col1"], how="left").select(pl.col.col1).group_by(pl.col.col1).agg().collect()
Note
The queries and timings above are a simplified example. My real world query is a bit more complicated:
SELECT t1.col1
FROM ... AS t1
LEFT JOIN ... AS t2
ON t1.col1 = t2.col1
AND t1.col2 IN (... small list of const strings ...)
AND t2.col3 IN (... another small list of const strings ...)
GROUP BY t1.col1
OS:
macOS, Linux
DuckDB Version:
0.9.2
DuckDB Client:
CLI
Full Name:
Jonas Haag
Affiliation:
QuantCo
Have you tried this on the latest main branch?
I have tested with a release build (and could not test with a main build)
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- Yes, I have
About this issue
- Original URL
- State: open
- Created 6 months ago
- Reactions: 1
- Comments: 16 (11 by maintainers)
Please, can we stop discussing Polars here? I don’t want to make it more difficult for the maintainers to read through this issue 🙏
Ok, I think I found a work around that gives you the same results. You just need to add the filter
t2.col1 = 'missing'. This won’t affect the result since the filtert1.col1 = t2.col1provides the transitivity. There is a significant speedup since t2 is filtered down so that only matching tuples are compared during the join operation instead of all combinations of tuples. The query below gives you the same results and executes faster using the given dataIf you want to make it even faster, removing the
t1.col1 = 'missing'from the query above allows the optimizer to plan a hash group by operator for the join.Let me know if this works for you, I saw a speedup of about 9 seconds to 0.01 seconds. I’ll look into implementing similar functionality in DuckDB, it should be doable, but I can’t guarantee when it’ll be merged
Hi @jonashaag, thanks for reporting this, I could reproduce this.
Sorry for the noise in the discussion, I cleaned up the thread now.