datafusion: Reported very slow performance compared to DuckDB in ibis-project
Describe the bug
As reported by @cpcloud in https://github.com/ibis-project/ibis/pull/7703
The most relevant portions:
DataFusion DataFusion never ran out of memory and had a memory profile similar to DuckDB: single digit GBs peak memory.
However, it was still extremely slow compared to DuckDB, about 9-10 minutes to run the whole workload.
Similarly to Polars I compared both the Ibis implementation and a hand-written SQL version (built from the generated Ibis code). Both had the same performance
I also looked at perf top while the DataFusion workload was running and saw this:
To Reproduce
TBD (first thing would be to get a datafusion only reproducer)
Looks like the query, from https://github.com/ibis-project/ibis/pull/7703 is
SELECT
month,
ext,
COUNT(DISTINCT project_name) AS project_count
FROM (
SELECT
project_name,
DATE_TRUNC('month', uploaded_on) AS month,
NULLIF(
REPLACE(
REPLACE(
REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
'cxx|cpp|cc|c|hpp|h',
'C/C++',
'g'
),
'^f.*$',
'Fortran',
'g'
),
'rs',
'Rust'
),
'go',
'Go'
),
'asm',
'Assembly'
),
''
) AS ext
FROM pypi
WHERE COALESCE(
ARRAY_LENGTH(
REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
) > 0,
FALSE
)
AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC
Expected behavior
No response
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 15 (15 by maintainers)
test with simple regexp_match query with
index-0.parquet
array-datafusion
took 47.847 seconds.duckdb
took 3.029 secondsHappy to try the query again once the next release is out!
@viirya found some very low hanging fruit: ~https://github.com/apache/arrow-datafusion/pull/8538~ https://github.com/apache/arrow-datafusion/pull/8631
I looked into the regular expression matching code in DataFusion – there is a lot of room for improvement:
It translates each argument into an array (even when the argument is a constant). Thus DataFusion is effectively compiling the regular expression for each row (not even each batch) which is unsurprisingly quite expensive
This is very fixable but the way the functions are wired in will take some finagling I think
A similar problem (recompiling the regex again and again) I found some time ago in the clickbench benchmark as well (query 28): https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark/blob/main/clickbench/queries-datafusion.sql#L29
@comphead recently added this to the contributor guide: https://arrow.apache.org/datafusion/library-user-guide/profiling.html#building-a-flamegraph
It is #8631 actually. 😄
Thanks for creating the issue!
The example can be simplified a bit.
It should be sufficient to see the performance difference with DuckDB by:
SELECT
.