duckdb: globbing multiple parquet files fails on httpfs filesystem [R]

What happens?

The documented syntax for working with multiple parquet files seems only to work with local files. I can use the httpfs extension to successfully access a single parquet file (🎉 🎉 ), but not multiple files, let alone hive-partitioned files. (Note this works just fine if we use arrow backend, e.g. with S3 resources, since arrow doesn’t support generic https), this issues is just referring to the duckdb native approach.

To Reproduce

This works:

library(DBI)
library(duckdb)
conn <- DBI::dbConnect(duckdb(),
                       ":memory:",
                       config=list("memory_limit"="12GB",
                                   "temp_directory" = "/tmp"))
DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")

query <- glue::glue("SELECT * FROM parquet_scan('",
                    "https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "aquatics/2022/aquatics-2022-08-29-persistenceRW.parquet",
                    "') LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df

Amazing!! But ideally I’d like to get the full (hive-partitioned) dataset. The documentation I’ve been able to find is a bit thin here, but from #4211 it seemed something like this might work, but no luck:

query <- glue::glue("SELECT * FROM parquet_scan('",
                    "https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "*/*/*.parquet'", ", HIVE_PARTITIONING=1",
                    ") LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df

In fact, I can’t even stack two parquet files together with the usual syntax:

query <- glue::glue("SELECT * FROM parquet_scan('[",
                    "https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "aquatics/2022/aquatics-2022-08-29-persistenceRW.parquet",
                    ", ",
                    "https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "aquatics/2022/aquatics-2022-08-28-persistenceRW.parquet",
                    "]') LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df

OS:

Linux (ubuntu 20.04)

DuckDB Version:

github version

DuckDB Client:

R

Full Name:

Carl Boettiger

Affiliation:

UC Berkeley

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (13 by maintainers)

Most upvoted comments

Thanks for clarifying! Yes, supporting a list of URLs would still be great.

In this case, this is also a public S3 bucket. I think duckdb has support for s3:// based URLs(?) (which would provide the mechanism to list contents), but I’m struggling to find documentation of the appropriate syntax (especially for alternative hosts – e.g. this is a MINIO server, so we would need an endpoint_override / AWS_DEFAULT_ENDPOINT option.)

We do support alternative hosts! The configuration parameters are documented in the config docs, but the following is basic usage example from one of our tests:

SET s3_secret_access_key='minio_duckdb_user_password';
SET s3_access_key_id='minio_duckdb_user';
SET s3_region='eu-west-1';
SET s3_endpoint='duckdb-minio.com:9000';
SET s3_use_ssl=false;

COPY lineitem TO 's3://test-bucket/multipart/export_large.parquet' (FORMAT 'parquet');

@cboettig haha yes was just typing this 😁 no worries!

My apologies! my example had missed the single quotes around each. The list format works just as you say. That’s awesome. Again sorry for the noise. (The above should have been:)


query <- glue::glue("SELECT * FROM parquet_scan([",
                    "'https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "aquatics/2022/aquatics-2022-08-29-persistenceRW.parquet'",
                    ", ",
                    "'https://data.ecoforecast.org/neon4cast-scores/parquet/",
                    "aquatics/2022/aquatics-2022-08-28-persistenceRW.parquet'",
                    "]) LIMIT 10")

thanks, good call! Yes, I confirm this works well on a public S3 bucket with globbing (https://gist.github.com/cboettig/b1840f20996139d422ab497613dfdbde). 🎉 wonderful. (Though I’m also still super happy about the pure HTTP access – it would be great if that could be extended to the list format… should I open a separate issue for that?)

(As an aside I’ve noticed that duckdb is stricter about schema matching across parquet blobs than arrow is on the same data – looks like arrow will permit some parquet files to be missing a given column and will treat those rows as NA, but duckdb will throw an error)