iceberg: Duplicate file name in Iceberg's metadata
Apache Iceberg version
1.3.1
Query engine
Spark
Please describe the bug 🐞
While writing data to an Iceberg table using Spark Streaming 3.4.1 / Iceberg 1.3.1 / EMR 6.13 we do observe multiple entries in the table’s metadata for a single file name (path + name).
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
|content|file_path |file_format|spec_id|partition|record_count|file_size_in_bytes|
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
|0 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET |0 |{471424} |1176385 |52215529 |
|0 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET |0 |{471424} |1152053 |51648666 |
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
This causes issues when reading data with Athena but it does not cause issues when reading with Spark (or opening parquet files directly with parquet-cli). We also see that the two occurrences of the file belong to different snapshots:
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
|status|snapshot_id |sequence_number|file_sequence_number|file_path |
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
|1 |5798287735063119103|605 |605 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
|1 |48372161143873894 |604 |604 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
This seems very similar to https://github.com/apache/iceberg/issues/8427 and https://github.com/apache/iceberg/issues/8609.
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 4
- Comments: 15 (9 by maintainers)
Ok I actually looked at the history of these changes now https://github.com/apache/iceberg/pull/5214 was never merged but followed by https://github.com/apache/iceberg/pull/6569/files which actually applied the change and would’ve been released in 1.2.0. I think your suspicion is correct @github-raphael-douyere
The goal for including the query ID looks to be to identify which spark job actually performed the write; previously there would’ve been a new UUID per write, and we would’ve avoided files stepping on each other.
Let me try and get a reproducible example, (we would want one anyways for verifying whatever fix we do actually works) ideally we can get the best of both worlds. I think to do that some combination of the query ID + the hostname + the thread ID would be truly unique and enable better debugging (at the cost of a really long filename 😃 ).