duckdb: DUCKB aggregations over arrow table are slower than pandas dataframe

What happens?

DUCKB aggregations over arrow table are slower than pandas dataframe

To Reproduce

import pandas as pd
import duckdb

lineitem = duckdb.query("SELECT * FROM 'lineitemsf1.snappy.parquet'").to_arrow_table()
lineitem_df = duckdb.query("SELECT * FROM 'lineitemsf1.snappy.parquet'").to_df()
con = duckdb.connect()

con.execute('PRAGMA threads=4')
ungrouped_aggregate = '''SELECT SUM(l_extendedprice), MIN(l_extendedprice), MAX(l_extendedprice), AVG(l_extendedprice) FROM lineitem'''
ungrouped_aggregate_df = '''SELECT SUM(l_extendedprice), MIN(l_extendedprice), MAX(l_extendedprice), AVG(l_extendedprice) FROM lineitem_df'''


%timeit -n 20 -r 10 duckdb.query(ungrouped_aggregate).to_arrow_table()
549 ms ± 52.8 ms per loop (mean ± std. dev. of 10 runs, 20 loops each) with 1 thread

%timeit -n 20 -r 10 con.query(ungrouped_aggregate).to_arrow_table()
569 ms ± 81.2 ms per loop (mean ± std. dev. of 10 runs, 20 loops each) with 4 threads

%timeit -n 20 -r 10 duckdb.query(ungrouped_aggregate_df).to_df()
17.2 ms ± 2.34 ms per loop (mean ± std. dev. of 10 runs, 20 loops each) with 1 thread

%timeit -n 20 -r 10 con.query(ungrouped_aggregate_df).to_df()
20.5 ms ± 1.21 ms per loop (mean ± std. dev. of 10 runs, 20 loops each) with 4 threads

Environment (please complete the following information):

  • OS: Windows
  • DuckDB Version: 0.3.1
  • DuckDB Client: Python

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

Many thanks for the quick replies @jonkeane and @westonpace. Unfortunately, we can’t use the C++ code, since we want to minimize linking. I’ll push a change today for both Python and R where we only have to_arrow_table and to_record_batch, both with a chunk_size parameter that defaults to 10ˆ6.

We can later make it more flexible with the tablebatchreader when its accessible trough R/Python

Actually, after staring at things on the Arrow side for a bit, I am ready to suggest an alternative workaround.

A big portion of the small batch overhead is in the scanner. While we are investigating methods to reduce small batch overhead in the execution plans I don’t know anyone that is looking into such a change on the scanner (when working with files we are rarely working with small batches).

My recommendation then would be to use a TableBatchReader but we don’t currently have that exposed in pyarrow (I filed ARROW-15823 to address this). I’m not entirely sure how much C++ code is allowed in arrow_array_stream.cpp but if you can can use the C++ TableBatchReader then you should be able to do this today.

Otherwise you would need to wait for ARROW-15823 for a more python only solution.

So I instrumented the code and found the issue here. The problem is that when the arrow table is created from lineitem it actually creates a batched arrow table (with batch sizes of 1024 elements). It looks like these are too small for arrow, which creates this huge additional cost to read them back. So you can solve this difference by either reading the parquet file directly from arrow, or by calling the combine_chunks() method.

I think I’ll add an optional parameter to the to_arrow_table(), where the user can state the number of vectors they want per batch. This is similar to what we have on fetch_arrow_chunk()

Unless @Mytherin or @hannes has other thoughts?

I can confirm this is happening.

Two things to notice: In our arrow integration, for doubles, we currently have to go over data once to transform NaNs into Nulls, so this will impact performance a bit, although not enough to justify the difference.

The arrow scans are only parallel for batched tables, so the table in your example won’t be executed in parallel.

However, I’m not sure if these reasons are sufficient to explain the time difference there, so I will profile this later this week 😃