datafusion: Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark
Describe the bug
DF reads parquet timestamp datatype as nanos from parquet file whereas DuckDb and Spark treats timestamp datatype as seconds
To Reproduce
create a parquet file with timestamp value -62125747200 and read it back
DuckDb or Spark reads the value correctly
0001-04-25 00:00:00
but DF reads timestamps as nanos and provides the wrong answer
❯ select * from test;
+-------------------------------+
| a |
+-------------------------------+
| 1754-12-22T22:43:41.128654848 |
+-------------------------------+
Expected behavior
Behavior should be the same
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 17 (17 by maintainers)
Thanks @edmondop please hold up on this. I’m fixing some timestamp literals issues and we will have discussion on timestamp precision in DF, so this ticket will depend on actions above.
I agree with the conclusion in the snowflake document, we should follow the specification and be consistent with other compliant implementations.
As an aside this type has been deprecated for almost a decade, why is spark still using it…
Looks like the analysis shows DF has a bunch of issues with timestamp type
select cast(1 as timestamp);
should be1970-01-01T00:00:01
not1970-01-01T00:00:00.000000001
1::timestamp
should be consistent withto_timestamp(1)
which works correctly nowParquet has a proper Timestamp field, correct? https://learn.microsoft.com/en-us/common-data-model/sdk/parquet-to-cdm-datatype
And that datatype supports a
precision
field which specifies millis/micros/nanos?So I presume this error only happens when reading an
INT
field and casting? If so, I’d 100% be on board copying the behavior of these other well known databases. Folks who stored their data properly asTimestamp
s shouldn’t be affected. It would be a breaking change to those who just stored as anINT
, but I think now is the time to make that change since datetime support is relatively new… it will only get harder to change going forward.