iceberg: Duplicate file name in Iceberg's metadata

Apache Iceberg version

1.3.1

Query engine

Spark

Please describe the bug 🐞

While writing data to an Iceberg table using Spark Streaming 3.4.1 / Iceberg 1.3.1 / EMR 6.13 we do observe multiple entries in the table’s metadata for a single file name (path + name).

+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
|content|file_path                                                                                                        |file_format|spec_id|partition|record_count|file_size_in_bytes|
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
|0      |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET    |0      |{471424} |1176385     |52215529          |
|0      |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET    |0      |{471424} |1152053     |51648666          |
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+

This causes issues when reading data with Athena but it does not cause issues when reading with Spark (or opening parquet files directly with parquet-cli). We also see that the two occurrences of the file belong to different snapshots:

+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
|status|snapshot_id        |sequence_number|file_sequence_number|file_path                                                                                                        |
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
|1     |5798287735063119103|605            |605                 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
|1     |48372161143873894  |604            |604                 |s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+

This seems very similar to https://github.com/apache/iceberg/issues/8427 and https://github.com/apache/iceberg/issues/8609.

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 4
  • Comments: 15 (9 by maintainers)

Most upvoted comments

Ok I actually looked at the history of these changes now https://github.com/apache/iceberg/pull/5214 was never merged but followed by https://github.com/apache/iceberg/pull/6569/files which actually applied the change and would’ve been released in 1.2.0. I think your suspicion is correct @github-raphael-douyere

The goal for including the query ID looks to be to identify which spark job actually performed the write; previously there would’ve been a new UUID per write, and we would’ve avoided files stepping on each other.

Let me try and get a reproducible example, (we would want one anyways for verifying whatever fix we do actually works) ideally we can get the best of both worlds. I think to do that some combination of the query ID + the hostname + the thread ID would be truly unique and enable better debugging (at the cost of a really long filename 😃 ).