iceberg: Iceberg partition date/day equality problem when created on timestamp
spark.sql("""CREATE TABLE iceberg.glue_catalog.iceberg_table ( order_id bigint,
customer_id bigint, order_amount DECIMAL(10,2), category string, trans_ts Timestamp)
USING iceberg
location 's3://xxx/yyyy/iceberg_table'
PARTITIONED BY (bucket(5, order_id), date(trans_ts))
TBLPROPERTIES ('format-version' = '2') """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD years(trans_ts) """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD months(trans_ts) """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD hours(trans_ts) """)
spark.sql("""ALTER TABLE iceberg.glue_catalog.iceberg_table ADD PARTITION FIELD category """)
spark.sql("""INSERT INTO iceberg.glue_catalog.iceberg_table
VALUES( 10001, 001, 06.17, 'soap', cast('2019-06-13 13:22:30' as timestamp) )
""")
Greater than works
trino> select * from iceberg.glue_catalog.iceberg_table where trans_ts > date'2019-06-12';
order_id | customer_id | order_amount | category | trans_ts
----------+-------------+--------------+----------+--------------------------------
10001 | 1 | 6.17 | soap | 2019-06-13 17:22:30.000000 UTC
Between works well
trino> select * from iceberg.glue_catalog.iceberg_table where trans_ts BETWEEN date'2019-06-12' and date'2019-06-14' ;
order_id | customer_id | order_amount | category | trans_ts
----------+-------------+--------------+----------+--------------------------------
10001 | 1 | 6.17 | soap | 2019-06-13 17:22:30.000000 UTC
Equality doesn’t work
trino>select * from iceberg.glue_catalog.iceberg_table where trans_ts=date'2019-06-13';
order_id | customer_id | order_amount | category | trans_ts
----------+-------------+--------------+----------+----------
(0 rows)
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 17 (8 by maintainers)
Is this a problem? I think it is correct, although not necessarily intuitive.
The filter is
trans_ts=date'2019-06-13'but the onlytrans_tsin the table is2019-06-13 17:22:30.000000 UTC, which is NOT equal to2019-06-13 00:00:00.000000.I can see how this would be unexpected, because you thought that the timestamp would be converted to a date but instead the date is converted to a timestamp, and then the equals check runs.
I think this is the same in both Trino and Spark because they both cast values to the wider type. In this case, timestamp rather than date.
I think it is Iceberg. I’m wrapping up a project, and then I’m planning to look into this. I suspect that we skip the file somewhere when we evaluate against the statistics and then decide to skip the file.