duckdb: globbing multiple parquet files fails on httpfs filesystem [R]
What happens?
The documented syntax for working with multiple parquet files seems only to work with local files. I can use the httpfs extension to successfully access a single parquet file (🎉 🎉 ), but not multiple files, let alone hive-partitioned files. (Note this works just fine if we use arrow backend, e.g. with S3 resources, since arrow doesn’t support generic https), this issues is just referring to the duckdb native approach.
To Reproduce
This works:
library(DBI)
library(duckdb)
conn <- DBI::dbConnect(duckdb(),
":memory:",
config=list("memory_limit"="12GB",
"temp_directory" = "/tmp"))
DBI::dbExecute(conn, "INSTALL 'httpfs';")
DBI::dbExecute(conn, "LOAD 'httpfs';")
query <- glue::glue("SELECT * FROM parquet_scan('",
"https://data.ecoforecast.org/neon4cast-scores/parquet/",
"aquatics/2022/aquatics-2022-08-29-persistenceRW.parquet",
"') LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df
Amazing!! But ideally I’d like to get the full (hive-partitioned) dataset. The documentation I’ve been able to find is a bit thin here, but from #4211 it seemed something like this might work, but no luck:
query <- glue::glue("SELECT * FROM parquet_scan('",
"https://data.ecoforecast.org/neon4cast-scores/parquet/",
"*/*/*.parquet'", ", HIVE_PARTITIONING=1",
") LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df
In fact, I can’t even stack two parquet files together with the usual syntax:
query <- glue::glue("SELECT * FROM parquet_scan('[",
"https://data.ecoforecast.org/neon4cast-scores/parquet/",
"aquatics/2022/aquatics-2022-08-29-persistenceRW.parquet",
", ",
"https://data.ecoforecast.org/neon4cast-scores/parquet/",
"aquatics/2022/aquatics-2022-08-28-persistenceRW.parquet",
"]') LIMIT 10")
df <- DBI::dbGetQuery(conn, query)
df
OS:
Linux (ubuntu 20.04)
DuckDB Version:
github version
DuckDB Client:
R
Full Name:
Carl Boettiger
Affiliation:
UC Berkeley
Have you tried this on the latest master branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (13 by maintainers)
We do support alternative hosts! The configuration parameters are documented in the config docs, but the following is basic usage example from one of our tests:
@cboettig haha yes was just typing this 😁 no worries!
My apologies! my example had missed the single quotes around each. The list format works just as you say. That’s awesome. Again sorry for the noise. (The above should have been:)
thanks, good call! Yes, I confirm this works well on a public S3 bucket with globbing (https://gist.github.com/cboettig/b1840f20996139d422ab497613dfdbde). 🎉 wonderful. (Though I’m also still super happy about the pure HTTP access – it would be great if that could be extended to the list format… should I open a separate issue for that?)
(As an aside I’ve noticed that duckdb is stricter about schema matching across parquet blobs than arrow is on the same data – looks like arrow will permit some parquet files to be missing a given column and will treat those rows as NA, but duckdb will throw an error)