iceberg: Iceberg partition date/day equality problem when created on timestamp

spark.sql("""CREATE TABLE iceberg.glue_catalog.iceberg_table ( order_id bigint,
customer_id bigint, order_amount DECIMAL(10,2), category string, trans_ts Timestamp)
USING iceberg
location 's3://xxx/yyyy/iceberg_table'
PARTITIONED BY (bucket(5, order_id), date(trans_ts))
TBLPROPERTIES ('format-version' = '2') """)

spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD years(trans_ts)  """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD months(trans_ts)  """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD hours(trans_ts) """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD category """)

spark.sql("""INSERT INTO iceberg.glue_catalog.iceberg_table
VALUES( 10001, 001, 06.17, 'soap', cast('2019-06-13 13:22:30' as timestamp) )
""")

Greater than works

trino> select * from iceberg.glue_catalog.iceberg_table where trans_ts > date'2019-06-12';
 order_id | customer_id | order_amount | category |            trans_ts
----------+-------------+--------------+----------+--------------------------------
    10001 |           1 |         6.17 | soap     | 2019-06-13 17:22:30.000000 UTC

Between works well
trino> select * from iceberg.glue_catalog.iceberg_table where trans_ts BETWEEN date'2019-06-12' and date'2019-06-14' ;
 order_id | customer_id | order_amount | category |            trans_ts
----------+-------------+--------------+----------+--------------------------------
    10001 |           1 |         6.17 | soap     | 2019-06-13 17:22:30.000000 UTC

Equality doesn’t work

trino>select * from iceberg.glue_catalog.iceberg_table where trans_ts=date'2019-06-13';
 order_id | customer_id | order_amount | category | trans_ts
----------+-------------+--------------+----------+----------
(0 rows)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 17 (8 by maintainers)

Most upvoted comments

Is this a problem? I think it is correct, although not necessarily intuitive.

The filter is trans_ts=date'2019-06-13' but the only trans_ts in the table is 2019-06-13 17:22:30.000000 UTC, which is NOT equal to 2019-06-13 00:00:00.000000.

I can see how this would be unexpected, because you thought that the timestamp would be converted to a date but instead the date is converted to a timestamp, and then the equals check runs.

I think this is the same in both Trino and Spark because they both cast values to the wider type. In this case, timestamp rather than date.

I think it is Iceberg. I’m wrapping up a project, and then I’m planning to look into this. I suspect that we skip the file somewhere when we evaluate against the statistics and then decide to skip the file.