iceberg: GlueTableOperations/DynamoDbTableOperations can delete current metadata file after incorrect exception handling

Apache Iceberg version

1.1.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

We recently encountered an issue whereby GlueTableOperations, while performing an Iceberg commit on behalf of GlueCatalog, can incorrectly interpret a successful commit as a failure, and delete the now-current table metadata file as part of cleanup. This leaves the Iceberg table inaccessible as the “current metadata pointer” now points to a deleted metadata file. We were able to correct this via an engineer manually calling Glue APIs to correct the pointer to the previous metadata file, but this represents an availability risk to our data lake service.

The reason this happens seems to be a direct result of the AWS client’s default 3 attempts for a given API call, whereby Iceberg only looks at the exception thrown by the final attempt, as shown here:

org.apache.iceberg.exceptions.CommitFailedException: Cannot commit catalog_name.database_name.table_name because Glue detected concurrent update
Caused by: software.amazon.awssdk.services.glue.model.ConcurrentModificationException: Update table failed due to concurrent modifications. (Service: Glue, Status Code: 400, Request ID: <removed>)
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Service returned error code ServiceUnavailableException (Service: Glue, Status Code: 500, Request ID: <removed>)

We were very quickly able to determine no other writers were running on this table during the incident, which means the ConcurrentModificationException had to be from one of its own prior attempts updating the catalog despite returning a failure. If it had received the standard timeout exception, the exception logic would have correctly called checkCommitStatus and determined the commit was actually successful. However, as it only saw the ConcurrentModificationException from the final attempt, it treated the commit as failed and performed cleanup it should not have done. Notably, it would have also exhibited this incorrect behavior if the ServiceUnavailableException had been the last attempt.

As expected, Iceberg attempts to refresh its metadata and retry the commit. Unfortunately, it just deleted the object the metadata pointer directs to, resulting in:

org.apache.spark.SparkException: Writing job aborted
Caused by: org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://fake-bucket-name/database_name.db/table_name/metadata/06814-ec5ff66c-af38-492c-ba38-55610536d9a7.metadata.json
Caused by: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: <removed>, Extended Request ID: <removed>)

While investigating, I noticed the same sequence of events would also cause DynamoDbTableOperations, which uses an AWS client configured in the same way, to take the same incorrect action, with the same outcome of the table becoming inaccessible.

Note: Removed some solution-specific information from error logs.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 16 (12 by maintainers)

Most upvoted comments

@c0d3monk So consulted Engine Support Page and Flink 1.13 support is considered End of Life and after asking around an official backport to an old Iceberg version is probably not in the cards. Based on this, my recommendation would be one of these 3 options:

Implement one of the documented workarounds above.
Upgrade Flink to a supported version.
Create a private fork and backport the fix.

Apologies for the inconvenience.

ChristinaTech on Apr 11, 2023

@ryanyuan @c0d3monk So, if the issue actually occurs the way to recover the table is stated in the overview via a manual Glue UpdateTable API call to set the metadata_location property to equal the previous_metadata_location. But that’s not so much as a workaround as recovery once it happens.

Considering the missing metadata failure condition is pretty easy to detect in code once it happens via catching NotFoundException, it would technically be possible to automate this fix and then retry the job, though you would have to be careful not to change anything else in the Glue Table metadata in the process to be safe.

As for a workaround that avoids ending up in this situation in the first place, some potential options besides my pending upstream fix are to:

Set the Catalog Option s3.delete-enabled to false so the step that actually corrupts the table becomes a no-op. If you do this though, you will Orphan any files you attempt to the system attempts to delete/expire, so make sure you don’t have that option set in whatever context you use to DeleteOrphanFiles.
If you want to use the prior option but limit what gets Orphaned, you can potentially extend S3FileIO with a version where deleteFile no-ops for metadata files, then specify that modified version of S3FileIO for your io-impl parameter.
Provide a custom AWS Client factory that disables API retries for the Glue API client. The downside, as documented earlier, is:

This hurts reliability in normal usage to an unacceptable degree.

ChristinaTech on Apr 3, 2023

I have pulled together a rough prototype of this solution and should have a proper PR out within the week.

ChristinaTech on Mar 21, 2023