delta-rs: Python write_deltalake to S3 fails to write due to "invalid json"

Environment

Delta-rs version: 0.6.2

Binding: Python

Environment: Ubuntu 22.04, Python 3.10, deltalake==0.6.2, Running against non-AWS S3 (Swift)


Bug

What happened: DeltaLake write fails.

My test code to write ‘df’ (a pandas dataframe) to an S3 location:

storage_options = {"AWS_ACCESS_KEY_ID": ACCESS_KEY, "AWS_SECRET_ACCESS_KEY":SECRET_KEY, "AWS_ENDPOINT_URL": ENDPOINT_URL, "AWS_REGION": 'us-east-1'}
write_deltalake('s3://joshuarobinson/test_deltalake/', df, storage_options=storage_options)

fails with the following error:

Traceback (most recent call last):
  File "/delta_write.py", line 19, in <module>
    write_deltalake('s3://joshuarobinson/test_deltalake/', df, storage_options=storage_options)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 156, in write_deltalake
    table = try_get_deltatable(table_or_uri)
  File "/usr/local/lib/python3.10/site-packages/deltalake/writer.py", line 332, in try_get_deltatable
    return DeltaTable(table_uri)
  File "/usr/local/lib/python3.10/site-packages/deltalake/table.py", line 91, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Failed to load checkpoint: Invalid JSON in checkpoint: expected value at line 1 column 1

Note that the destination path is empty, i.e., I’m writing a brand new table

$ s5cmd ls s3://joshuarobinson/test_deltalake/
ERROR "ls s3://joshuarobinson/test_deltalake/": no object found

Also tried:

  1. I have tested with all four values of “mode” and had the same result.
  2. also tried to manually build a pyarrow filesystem and pass that but did not work.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22

Commits related to this issue

Most upvoted comments

@shazamkash Did you read the error message from the write?

PyDeltaTableError: Failed to read delta log object: Generic DeltaS3ObjectStore error: Atomic rename requires a LockClient for S3 backends. Either configure the LockClient, or set AWS_S3_ALLOW_UNSAFE_RENAME=true to opt out of support for concurrent writers.

The writer tried to make the table, but couldn’t complete the commit. That is why there is a tmp file. This error message is intentional.

If you add AWS_S3_ALLOW_UNSAFE_RENAME=true (either as an environment variable or in storage_options), it should write successfully.

I currently get SignatureDoesNotMatch, atm when providing credentials.

When doing

            delta_table = DeltaTable(table_uri=uri, storage_options=self.auth.storage_options)
            
            writer.write_deltalake(
                data=dataframe,
                table_or_uri=delta_table,
                mode="append",
                overwrite_schema=True
            )

@joshuarobinson I have the same issue.

Looking at the code, it currently expects the table to already exist:

write_deltalake performs:

    if isinstance(table_or_uri, str):
        if "://" in table_or_uri:
            table_uri = table_or_uri
        else:
            # Non-existant local paths are only accepted as fully-qualified URIs
            table_uri = "file://" + str(Path(table_or_uri).absolute())
        table = try_get_deltatable(table_or_uri)
    else:
        table = table_or_uri
        table_uri = table._table.table_uri()

When try_get_deltatable is called, it then calls DeltaTable.

It seems like it wants storage_options to initialise a new delta table but currently it does not pass it through, even if you send it with write_deltatable.

Strange behaviour indeed.