datafusion: Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark

Describe the bug

DF reads parquet timestamp datatype as nanos from parquet file whereas DuckDb and Spark treats timestamp datatype as seconds

To Reproduce

create a parquet file with timestamp value -62125747200 and read it back

DuckDb or Spark reads the value correctly

0001-04-25 00:00:00

but DF reads timestamps as nanos and provides the wrong answer

❯ select * from test;
+-------------------------------+
| a                             |
+-------------------------------+
| 1754-12-22T22:43:41.128654848 |
+-------------------------------+

Expected behavior

Behavior should be the same

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Thanks @edmondop please hold up on this. I’m fixing some timestamp literals issues and we will have discussion on timestamp precision in DF, so this ticket will depend on actions above.

I agree with the conclusion in the snowflake document, we should follow the specification and be consistent with other compliant implementations.

As an aside this type has been deprecated for almost a decade, why is spark still using it…

Looks like the analysis shows DF has a bunch of issues with timestamp type

  • select cast(1 as timestamp); should be 1970-01-01T00:00:01 not 1970-01-01T00:00:00.000000001
  • 1::timestamp should be consistent with to_timestamp(1) which works correctly now
  • reading Int96 Timestamp type is not consistent with major engines(this ticket)
  • writing timestamp from COPY doesn’t generate all neccessary parquet information

Parquet has a proper Timestamp field, correct? https://learn.microsoft.com/en-us/common-data-model/sdk/parquet-to-cdm-datatype

And that datatype supports a precision field which specifies millis/micros/nanos?

So I presume this error only happens when reading an INT field and casting? If so, I’d 100% be on board copying the behavior of these other well known databases. Folks who stored their data properly as Timestamps shouldn’t be affected. It would be a breaking change to those who just stored as an INT, but I think now is the time to make that change since datetime support is relatively new… it will only get harder to change going forward.