delta-rs: Writing DeltaTable to Azure with partitioning fails and goes into endless loop w/o error message
Environment
Delta-rs version: 0.12.0
Binding: Python
Environment:
- Cloud provider: Azure Storage Gen2
- OS: Apple M1 Pro with Apple Sonoma (14.0)
- Other: Python 3.10.11
Bug
What happened:
I have a arrow dataset (~120m records, ~2.8GB as parquet files) with the following schema that I want to write to ADLS Gen2 partitioned by cols exchange_code and year:
schema = pa.schema(
[
pa.field("date", pa.date32()),
pa.field("open", pa.float64()),
pa.field("high", pa.float64()),
pa.field("low", pa.float64()),
pa.field("close", pa.float64()),
pa.field("adjusted_close", pa.float64()),
pa.field("volume", pa.int64()),
pa.field("exchange_code", pa.string()),
pa.field("security_code", pa.string()),
pa.field("year", pa.int32()),
pa.field("created_at", pa.timestamp("us")),
]
)
It works successfully when:
- writing to local filesystem (takes about ~1min) using
deltalake.write_deltalake - writing to Azure without any partitioning using
deltalake.write_deltalakeand argstorage_options - using
pyarrow.dataset.write_datasetwith partitions andadlfs.AzureBlobFileSystemas filesystem arg - writing only the first 1-5m records to Azure using
deltalake.write_deltalakeand above mentioned partitioning
But not when:
- writing full dataset to Azure using
deltalake.write_deltalakewith above mentioned partitioning.
There is no error message - it just goes on and on forever w/o creating any dir or files and also no error message appears. In some cases it created 2-3 partition directories at the beginning but then nothing happens thereafter.
Column exchange_code has 15 distinct values and year ~10-80 distinct values depending on the exchange.
What you expected to happen: DeltaTable to be created successfully.
How to reproduce it:
Here is my code:
import os
import polars as pl
import pyarrow.dataset as ds
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
client = os.getenv("AZURE_CLIENT_ID")
tenant = os.getenv("AZURE_TENANT_ID")
secret = os.getenv("AZURE_CLIENT_SECRET")
storage_options = {"account_name": account_name, "client_id": client, "client_secret": secret, "tenant_id": tenant}
schema = pa.schema(
[
pa.field("date", pa.date32()),
pa.field("open", pa.float64()),
pa.field("high", pa.float64()),
pa.field("low", pa.float64()),
pa.field("close", pa.float64()),
pa.field("adjusted_close", pa.float64()),
pa.field("volume", pa.int64()),
pa.field("exchange_code", pa.string()),
pa.field("security_code", pa.string()),
pa.field("year", pa.int32()),
pa.field("created_at", pa.timestamp("us")),
]
)
dataset = ds.dataset(source="arrow_dataset", format="parquet", schema=schema)
write_deltalake(
data=dataset.to_batches(),
table_or_uri="abfs://temp/testDelta",
storage_options=storage_options,
schema=schema,
partition_by=["exchange_code", "year"],
)
If the dataset is needed, please let me know.
More details:
I have tried different options for args max_rows_per_file, max_rows_per_group, min_rows_per_group, max_open_files but also no success.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 22 (7 by maintainers)
Commits related to this issue
- fix: add buffer flushing to filesystem writes (#1911) # Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause in... — committed to delta-io/delta-rs by r3stl355 7 months ago
- fix: add buffer flushing to filesystem writes (#1911) # Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause in... — committed to ion-elgreco/delta-rs by r3stl355 7 months ago
Quick update in case someone wants to follow up (I don’t know how much time I’ll have in the next few days). After lots of back and forth between different environments we now have a single parquet file which is consistently getting stuck on attempt to write to Azure storage. File is not large - about 200 MB and I’m only creating 23 partition folder, each file in the single partition about 15 MB. When I did a first run, it created one file and one partition folder and got stuck. This is using
pyarrowv9. I’ll try with other versions.Parquet file is created from NYC taxi data as suggested by @ion-elgreco so I can put for download somewhere if any is interested in trying out, together with the code.
When it’s stuck, terminal becomes non-responsive to Ctrl+C - is that a sign that the main thread is stuck waiting for a lock or IO?
@ion-elgreco Thanks for following up on this issue.
I’m on vacation and don’t have access to our repo, so I can’t copy paste the exact code that’s working but it should be close to the following.
I’m using
adlfsas an fsspec-compatible filesystem interface.Hope that helps. I’m back in two weeks and then able to provide the working code.