iceberg: GlueTableOperations/DynamoDbTableOperations can delete current metadata file after incorrect exception handling
Apache Iceberg version
1.1.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
We recently encountered an issue whereby GlueTableOperations, while performing an Iceberg commit on behalf of GlueCatalog, can incorrectly interpret a successful commit as a failure, and delete the now-current table metadata file as part of cleanup. This leaves the Iceberg table inaccessible as the “current metadata pointer” now points to a deleted metadata file. We were able to correct this via an engineer manually calling Glue APIs to correct the pointer to the previous metadata file, but this represents an availability risk to our data lake service.
The reason this happens seems to be a direct result of the AWS client’s default 3 attempts for a given API call, whereby Iceberg only looks at the exception thrown by the final attempt, as shown here:
org.apache.iceberg.exceptions.CommitFailedException: Cannot commit catalog_name.database_name.table_name because Glue detected concurrent update
Caused by: software.amazon.awssdk.services.glue.model.ConcurrentModificationException: Update table failed due to concurrent modifications. (Service: Glue, Status Code: 400, Request ID: <removed>)
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Service returned error code ServiceUnavailableException (Service: Glue, Status Code: 500, Request ID: <removed>)
We were very quickly able to determine no other writers were running on this table during the incident, which means the ConcurrentModificationException had to be from one of its own prior attempts updating the catalog despite returning a failure. If it had received the standard timeout exception, the exception logic would have correctly called checkCommitStatus and determined the commit was actually successful. However, as it only saw the ConcurrentModificationException from the final attempt, it treated the commit as failed and performed cleanup it should not have done. Notably, it would have also exhibited this incorrect behavior if the ServiceUnavailableException had been the last attempt.
As expected, Iceberg attempts to refresh its metadata and retry the commit. Unfortunately, it just deleted the object the metadata pointer directs to, resulting in:
org.apache.spark.SparkException: Writing job aborted
Caused by: org.apache.iceberg.exceptions.NotFoundException: Location does not exist: s3://fake-bucket-name/database_name.db/table_name/metadata/06814-ec5ff66c-af38-492c-ba38-55610536d9a7.metadata.json
Caused by: software.amazon.awssdk.services.s3.model.NoSuchKeyException: The specified key does not exist. (Service: S3, Status Code: 404, Request ID: <removed>, Extended Request ID: <removed>)
While investigating, I noticed the same sequence of events would also cause DynamoDbTableOperations, which uses an AWS client configured in the same way, to take the same incorrect action, with the same outcome of the table becoming inaccessible.
Note: Removed some solution-specific information from error logs.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 16 (12 by maintainers)
@c0d3monk So consulted Engine Support Page and Flink 1.13 support is considered End of Life and after asking around an official backport to an old Iceberg version is probably not in the cards. Based on this, my recommendation would be one of these 3 options:
Apologies for the inconvenience.
@ryanyuan @c0d3monk So, if the issue actually occurs the way to recover the table is stated in the overview via a manual Glue
UpdateTableAPI call to set themetadata_locationproperty to equal theprevious_metadata_location. But that’s not so much as a workaround as recovery once it happens.Considering the missing metadata failure condition is pretty easy to detect in code once it happens via catching
NotFoundException, it would technically be possible to automate this fix and then retry the job, though you would have to be careful not to change anything else in the Glue Table metadata in the process to be safe.As for a workaround that avoids ending up in this situation in the first place, some potential options besides my pending upstream fix are to:
s3.delete-enabledtofalseso the step that actually corrupts the table becomes a no-op. If you do this though, you will Orphan any files you attempt to the system attempts to delete/expire, so make sure you don’t have that option set in whatever context you use to DeleteOrphanFiles.deleteFileno-ops for metadata files, then specify that modified version of S3FileIO for yourio-implparameter.I have pulled together a rough prototype of this solution and should have a proper PR out within the week.