delta-rs: Writing DeltaTable to Azure with partitioning fails and goes into endless loop w/o error message

Environment

Delta-rs version: 0.12.0

Binding: Python

Environment:

Cloud provider: Azure Storage Gen2
OS: Apple M1 Pro with Apple Sonoma (14.0)
Other: Python 3.10.11

Bug

What happened: I have a arrow dataset (~120m records, ~2.8GB as parquet files) with the following schema that I want to write to ADLS Gen2 partitioned by cols exchange_code and year:

schema = pa.schema(
    [
        pa.field("date", pa.date32()),
        pa.field("open", pa.float64()),
        pa.field("high", pa.float64()),
        pa.field("low", pa.float64()),
        pa.field("close", pa.float64()),
        pa.field("adjusted_close", pa.float64()),
        pa.field("volume", pa.int64()),
        pa.field("exchange_code", pa.string()),
        pa.field("security_code", pa.string()),
        pa.field("year", pa.int32()),
        pa.field("created_at", pa.timestamp("us")),
    ]
)

It works successfully when:

writing to local filesystem (takes about ~1min) using deltalake.write_deltalake
writing to Azure without any partitioning using deltalake.write_deltalake and arg storage_options
using pyarrow.dataset.write_dataset with partitions and adlfs.AzureBlobFileSystem as filesystem arg
writing only the first 1-5m records to Azure using deltalake.write_deltalake and above mentioned partitioning

But not when:

writing full dataset to Azure using deltalake.write_deltalake with above mentioned partitioning.

There is no error message - it just goes on and on forever w/o creating any dir or files and also no error message appears. In some cases it created 2-3 partition directories at the beginning but then nothing happens thereafter.

Column exchange_code has 15 distinct values and year ~10-80 distinct values depending on the exchange.

What you expected to happen: DeltaTable to be created successfully.

How to reproduce it:

Here is my code:

import os
import polars as pl
import pyarrow.dataset as ds
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa

account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
client = os.getenv("AZURE_CLIENT_ID")
tenant = os.getenv("AZURE_TENANT_ID")
secret = os.getenv("AZURE_CLIENT_SECRET")

storage_options = {"account_name": account_name, "client_id": client, "client_secret": secret, "tenant_id": tenant}

schema = pa.schema(
    [
        pa.field("date", pa.date32()),
        pa.field("open", pa.float64()),
        pa.field("high", pa.float64()),
        pa.field("low", pa.float64()),
        pa.field("close", pa.float64()),
        pa.field("adjusted_close", pa.float64()),
        pa.field("volume", pa.int64()),
        pa.field("exchange_code", pa.string()),
        pa.field("security_code", pa.string()),
        pa.field("year", pa.int32()),
        pa.field("created_at", pa.timestamp("us")),
    ]
)

dataset = ds.dataset(source="arrow_dataset", format="parquet", schema=schema)

write_deltalake(
    data=dataset.to_batches(),
    table_or_uri="abfs://temp/testDelta",
    storage_options=storage_options,
    schema=schema,
    partition_by=["exchange_code", "year"],
)

If the dataset is needed, please let me know.

More details: I have tried different options for args max_rows_per_file, max_rows_per_group, min_rows_per_group, max_open_files but also no success.

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 22 (7 by maintainers)

Commits related to this issue

fix: add buffer flushing to filesystem writes (#1911) # Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause in... — committed to delta-io/delta-rs by r3stl355 7 months ago
fix: add buffer flushing to filesystem writes (#1911) # Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause in... — committed to ion-elgreco/delta-rs by r3stl355 7 months ago

Most upvoted comments

Quick update in case someone wants to follow up (I don’t know how much time I’ll have in the next few days). After lots of back and forth between different environments we now have a single parquet file which is consistently getting stuck on attempt to write to Azure storage. File is not large - about 200 MB and I’m only creating 23 partition folder, each file in the single partition about 15 MB. When I did a first run, it created one file and one partition folder and got stuck. This is using pyarrow v9. I’ll try with other versions.

Parquet file is created from NYC taxi data as suggested by @ion-elgreco so I can put for download somewhere if any is interested in trying out, together with the code.

When it’s stuck, terminal becomes non-responsive to Ctrl+C - is that a sign that the main thread is stuck waiting for a lock or IO?

r3stl355 on Nov 21, 2023

@roeap it seems the issue was already flagged by @stefnba.

Since it works fine with PyArrow directly I would guess it’s the delta filesystem handler. @stefnba can you confirm how you use pyarrow directly, did you create your own file system handler with the pyarrow filesystem?

@ion-elgreco Thanks for following up on this issue.

I’m on vacation and don’t have access to our repo, so I can’t copy paste the exact code that’s working but it should be close to the following.

I’m using adlfs as an fsspec-compatible filesystem interface.

import pyarrow.dataset as ds
from adlfs import AzureBlobFileSystem

filesystem = AzureBlobFileSystem(account_name="my_account_name", anon=False)

# read local
input_ds = ds.dataset("path/to/local/ds")

# write to Azure using adlfs filesystem
ds.write_dataset(
	data=input_ds,
	base_dir="container/blob",
	filesystem=filesystem,
	partitioning_flavor="hive",
	partitioning=["col1", "col2"]
)

Hope that helps. I’m back in two weeks and then able to provide the working code.

stefnba on Nov 10, 2023