duckdb: union_by_name is vastly slower/fails over httpfs

What happens?

A query as simple as select max(date) from read_parquet('s3://bucket/**/*.parquet', hive_partitioning=true, union_by_name=true) becomes unusably slow, seemingly, as the set of files to scan increases.

  • I am enabling union_by_name preemptively, because i know that the set of columns being written to these files may increase over time, which will lead to historically written files lacking newer columns.

  • by “unusably slow” i mean, firstly it takes a long time, but also it begins producing duckdb.IOException: IO Error: Connection error for HTTP HEAD to <bucket http url>

  • In my particular example date is a hive partition key, which should mean (i think) that there’s not even any data to scan.

To Reproduce

S3 bucket with structure like:

date=2023-01-01/part0.parquet
date=2023-01-02/part0.parquet
...
date=2023-06-20/part0.parquet

File can be as simple as (parquet equivalent): date, id\n{date},1

For 300 dates worth of this ^: locally, select max(date) from read_parquet('s3://bucket/**/*.parquet', hive_partitioning=true, union_by_name=true) takes ~4m, whereas union_by_name=false is essentially instant.

OS:

aarch64

DuckDB Version:

0.8.0/0.8.1

DuckDB Client:

Python

Full Name:

Dan Cardin

Affiliation:

Known

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

Most upvoted comments

At least for my original usecase, it will need to support hive partitioning. But once it does, i think i’ll be happy to swap and close this issue, assuming union_by_name being slow cannot otherwise be avoided; since this schema seems like an incompatible alternative.

We actually merged a commit adding support for the schema parameter to read_parquet yesterday - see https://github.com/duckdb/duckdb/pull/9123

(obviously i dont know the internals here but) It feels like if I’m referencing specific columns rather than *, then the schema should be able to be inferred by seeking just enough files to find the columns in question?

That doesn’t align with the internals and wouldn’t really work in practice. We need to scan files to know if the column is even in there - and we need to know the type of the column as well. The type can only be found in union_by_name by looking through all files. If one file has the column as a VARCHAR, the column is promoted to VARCHAR, even if the others have the column as INTEGER.

Even if I were willing to accept the slowdown, I seem to be unable to get it to function in practice. It eventually times out (seemingly through the http_timeout setting), which feels odd because any individual file is definitely not exceeding this timeout

That does seem odd indeed and should probably be investigated.