datafusion: Reported very slow performance compared to DuckDB in ibis-project

Describe the bug

As reported by @cpcloud in https://github.com/ibis-project/ibis/pull/7703

The most relevant portions:

DataFusion DataFusion never ran out of memory and had a memory profile similar to DuckDB: single digit GBs peak memory.

However, it was still extremely slow compared to DuckDB, about 9-10 minutes to run the whole workload.

Similarly to Polars I compared both the Ibis implementation and a hand-written SQL version (built from the generated Ibis code). Both had the same performance

I also looked at perf top while the DataFusion workload was running and saw this:

289368886-a33bf90f-8877-43b0-8a4d-328f9721f264

To Reproduce

TBD (first thing would be to get a datafusion only reproducer)

Looks like the query, from https://github.com/ibis-project/ibis/pull/7703 is

SELECT
  month,
  ext,
  COUNT(DISTINCT project_name) AS project_count
FROM (
  SELECT
    project_name,
    DATE_TRUNC('month', uploaded_on) AS month,
    NULLIF(
      REPLACE(
        REPLACE(
          REPLACE(
            REGEXP_REPLACE(
              REGEXP_REPLACE(
                REGEXP_MATCH(path, CONCAT('(', '\.([a-z0-9]+)$', ')'))[2],
                'cxx|cpp|cc|c|hpp|h',
                'C/C++',
                'g'
              ),
              '^f.*$',
              'Fortran',
              'g'
            ),
            'rs',
            'Rust'
          ),
          'go',
          'Go'
        ),
        'asm',
        'Assembly'
      ),
      ''
    ) AS ext
  FROM pypi
  WHERE COALESCE(
      ARRAY_LENGTH(
        REGEXP_MATCH(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
      ) > 0,
      FALSE
    )
    AND NOT COALESCE(ARRAY_LENGTH(REGEXP_MATCH(path, '(^|/)test(|s|ing)')) > 0, FALSE)
    AND NOT STRPOS(path, '/site-packages/') > 0
)
WHERE ext IS NOT NULL
GROUP BY month, ext
ORDER BY month DESC, project_count DESC

Expected behavior

No response

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 1
  • Comments: 15 (15 by maintainers)

Most upvoted comments

test with simple regexp_match query with index-0.parquet array-datafusion took 47.847 seconds.

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_MATCH(path, '\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


DataFusion CLI v33.0.0
+----------+
| COUNT(*) |
+----------+
| 5834398  |
+----------+
1 row in set. Query took 47.847 seconds.

../datafusion-cli/target/release/datafusion-cli -f d0.sql  225.03s user 5.77s system 477% cpu 48.353 total

duckdb took 3.029 seconds

SELECT COUNT(*) FROM '*.parquet' WHERE
    ARRAY_LENGTH(
      REGEXP_EXTRACT_ALL(path, '\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$')
    ) > 0;


┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│      5834398 │
└──────────────┘
../../duckdb/build/release/duckdb -s "`cat d1.sql`"  15.70s user 1.78s system 576% cpu 3.029 total

Happy to try the query again once the next release is out!

I looked into the regular expression matching code in DataFusion – there is a lot of room for improvement:

It translates each argument into an array (even when the argument is a constant). Thus DataFusion is effectively compiling the regular expression for each row (not even each batch) which is unsurprisingly quite expensive

This is very fixable but the way the functions are wired in will take some finagling I think

A similar problem (recompiling the regex again and again) I found some time ago in the clickbench benchmark as well (query 28): https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark/blob/main/clickbench/queries-datafusion.sql#L29

Hi, could you share the method to draw the flamegraph?

@comphead recently added this to the contributor guide: https://arrow.apache.org/datafusion/library-user-guide/profiling.html#building-a-flamegraph

It is #8631 actually. 😄

Thanks for creating the issue!

The example can be simplified a bit.

It should be sufficient to see the performance difference with DuckDB by:

  1. Using any single file from the dataset (they’re all big enough to take some noticeable amount of time)
  2. Removing the aggregation and using only the inner SELECT.