duckdb: Very slow join on small table (bad choice of join algorithm?)

What happens?

Very slow join on a small table.

This is extracted from a larger query on a multi-billion row table. The performance behavior can be observed at any scale though.

To Reproduce

See attached zip: slow.zip

SELECT t1.col1
FROM read_parquet('slow3.parquet') AS t1
LEFT JOIN read_parquet('slow3.parquet') AS t2
    ON t1.col1 = t2.col1
    -- The following condition is always false on the reproducer dataset.
    -- On the real world dataset, it is sometimes false.
    AND t1.col1 = 'missing'
GROUP BY t1.col1

Timing (small/large):

DuckDB: 9/35 s chdb: 15/20 ms

NB: I don’t know how to write the query above in Polars, but here are timing for a query without AND t1.col1 = 'missing':

DuckDB: 10 s / 40 s chdb: 1.5 s / 23 s Polars (code below): 7 s / OOM

df = pl.scan_parquet("slow3.parquet")
df.join(df, on=["col1"], how="left").select(pl.col.col1).group_by(pl.col.col1).agg().collect()

Note

The queries and timings above are a simplified example. My real world query is a bit more complicated:

SELECT t1.col1
FROM ... AS t1
LEFT JOIN ... AS t2
    ON t1.col1 = t2.col1
    AND t1.col2 IN (... small list of const strings ...)
    AND t2.col3 IN (... another small list of const strings ...)
GROUP BY t1.col1

OS:

macOS, Linux

DuckDB Version:

0.9.2

DuckDB Client:

CLI

Full Name:

Jonas Haag

Affiliation:

QuantCo

Have you tried this on the latest main branch?

I have tested with a release build (and could not test with a main build)

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Reactions: 1
  • Comments: 16 (11 by maintainers)

Most upvoted comments

Please, can we stop discussing Polars here? I don’t want to make it more difficult for the maintainers to read through this issue 🙏

Ok, I think I found a work around that gives you the same results. You just need to add the filter t2.col1 = 'missing'. This won’t affect the result since the filter t1.col1 = t2.col1 provides the transitivity. There is a significant speedup since t2 is filtered down so that only matching tuples are compared during the join operation instead of all combinations of tuples. The query below gives you the same results and executes faster using the given data

SELECT t1.col1
FROM read_parquet('slow3.parquet') AS t1
LEFT JOIN read_parquet('slow3.parquet') AS t2
    ON t1.col1 = t2.col1
    AND t1.col1 = 'missing'
    AND t2.col1 = 'missing'
GROUP BY t1.col1;

If you want to make it even faster, removing the t1.col1 = 'missing' from the query above allows the optimizer to plan a hash group by operator for the join.

SELECT t1.col1
FROM read_parquet('slow3.parquet') AS t1
LEFT JOIN read_parquet('slow3.parquet') AS t2
    ON t1.col1 = t2.col1
    AND t2.col1 = 'missing'
GROUP BY t1.col1;

Let me know if this works for you, I saw a speedup of about 9 seconds to 0.01 seconds. I’ll look into implementing similar functionality in DuckDB, it should be doable, but I can’t guarantee when it’ll be merged

Hi @jonashaag, thanks for reporting this, I could reproduce this.

Sorry for the noise in the discussion, I cleaned up the thread now.