duckdb: union_by_name is vastly slower/fails over httpfs
What happens?
A query as simple as select max(date) from read_parquet('s3://bucket/**/*.parquet', hive_partitioning=true, union_by_name=true) becomes unusably slow, seemingly, as the set of files to scan increases.
-
I am enabling
union_by_namepreemptively, because i know that the set of columns being written to these files may increase over time, which will lead to historically written files lacking newer columns. -
by “unusably slow” i mean, firstly it takes a long time, but also it begins producing
duckdb.IOException: IO Error: Connection error for HTTP HEAD to <bucket http url> -
In my particular example
dateis a hive partition key, which should mean (i think) that there’s not even any data to scan.
To Reproduce
S3 bucket with structure like:
date=2023-01-01/part0.parquet
date=2023-01-02/part0.parquet
...
date=2023-06-20/part0.parquet
File can be as simple as (parquet equivalent): date, id\n{date},1
For 300 dates worth of this ^: locally, select max(date) from read_parquet('s3://bucket/**/*.parquet', hive_partitioning=true, union_by_name=true) takes ~4m, whereas union_by_name=false is essentially instant.
OS:
aarch64
DuckDB Version:
0.8.0/0.8.1
DuckDB Client:
Python
Full Name:
Dan Cardin
Affiliation:
Known
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16
At least for my original usecase, it will need to support hive partitioning. But once it does, i think i’ll be happy to swap and close this issue, assuming
union_by_namebeing slow cannot otherwise be avoided; since this schema seems like an incompatible alternative.We actually merged a commit adding support for the
schemaparameter toread_parquetyesterday - see https://github.com/duckdb/duckdb/pull/9123That doesn’t align with the internals and wouldn’t really work in practice. We need to scan files to know if the column is even in there - and we need to know the type of the column as well. The type can only be found in
union_by_nameby looking through all files. If one file has the column as aVARCHAR, the column is promoted toVARCHAR, even if the others have the column asINTEGER.That does seem odd indeed and should probably be investigated.