duckdb: Missing timestamp components for EXTRACT

The EXTRACT method on timestamps (and date, time, timestamp_tz) are missing fields:

TIME:

  • subsecond

DATE:

  • iso_year
  • us_year
  • day_of_year
  • monday_day_of_week
  • sunday_day_of_week
  • monday_week
  • sunday_week
  • iso_week
  • us_week
  • unix_time

TIMESTAMP_TZ:

  • timezone_offset

all the fields missing from time or date are also missing from: TIMESTAMP_SEC TIMESTAMP_MS TIMESTAMP_US TIMESTAMP_NS

Descriptions that provide some more context for these fields/components:

* SUBSECOND Return number of microseconds since the last full second of the given timestamp.

* ISO_YEAR Return the ISO 8601 week-numbering year. First week of an ISO year has the majority (4 or more) of its
  days in January.
* US_YEAR Return US epidemiological year. First week of US epidemiological year has the majority (4 or more) of
  its days in January. Last week of US epidemiological year has the year's last Wednesday in it.
  US epidemiological week starts on Sunday.
* DAY_OF_YEAR Return the day of year (1 to 366). January 1st maps to day number 1, February 1st to 32, etc.
* MONDAY_DAY_OF_WEEK Return the day of week from Monday (1) to Sunday (7).
* SUNDAY_DAY_OF_WEEK Return the day of week from Sunday (1) to Saturday (7).
* MONDAY_WEEK Return week of year number (1-53). First week starts on first Monday of January.
* SUNDAY_WEEK Return week of year number (1-53). First week starts on first Sunday of January.
* ISO_WEEK Return ISO week of year number (1-53). First ISO calendar week has the majority (4 or more) of its
  days in January. ISO week starts on Monday.
* US_WEEK Return ISO-like week of year number (1-53). First US calendar week has the majority (4 or more) of its days in January. US week starts on Sunday.
* UNIX_TIME Return number of seconds that have elapsed since 1970-01-01 00:00:00 UTC, ignoring leap seconds.

* TIMEZONE_OFFSET Return number of seconds of timezone offset to UTC.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 26 (10 by maintainers)

Most upvoted comments

To me making this change at the Subtrait consumer/producer level makes the most sense. The DuckDB substrait consumer/producer knows DuckDB’s and Substrait’s semantics, and can add a +1 or -1 as required as part of the translation. Or is there a requirement that a function maps 1:1 at that level?

Which means we have to transform this to dayofweek(x)+1 to comply to the substrait spec. Because of this, the result will be different between the substrait compiled query, and our original query.

I don’t quite follow why the result would be different- the producer would add a +1, and the consumer would add a -1. The end result of the query will then be the same as long as the consumer/producer both make the correct translation, no?

This stuff is usually handled in databases by sessions settings, which add another wrinkle to the substrait design: The same query can produce different results based on the user’s session settings. One common example is the default ordering of NULLs, which a number of RDBMSes - inducing DuckDB - implement. Weekday settings are another.

No existing function yet has been described to depend on session settings. However, there was some discussion about it at some point (I think in the proposal for random which didn’t make it through) so I won’t rule it out entirely.

However, in this case, session settings should not be a wrinkle. Once we clarify in Substrait whether these indices are 0-based or 1-based then there should be one answer that does not depend on session settings.

To me making this change at the Subtrait consumer/producer level makes the most sense. The DuckDB substrait consumer/producer knows DuckDB’s and Substrait’s semantics, and can add a +1 or -1 as required as part of the translation. Or is there a requirement that a function maps 1:1 at that level?

If DuckDb is the only consumer that uses 1-based indexing, given that DuckDb can already consume and produce Substrait, I think it would be fine to handle it within DuckDb. However, if there are other producers / consumers, then I think there is value in expressing the option. There’s pros & cons to both approaches.

  • If you keep Substrait’s option set limited then it is harder for consumers to adopt Substrait because they have to map everything to a single alternative dialect.
  • If you make Substrait’s option set very flexible then it is harder for producers to create plans that successfully run against consumers because the likelihood of incompatible options is high.

However, I think the goal of Substrait (and this is certainly open for discussion) is to lean towards a flexible option set with the hopes that translators eventually emerge which centralize all of these +1/-1 translations in a single point.

I’ve opened https://github.com/substrait-io/substrait/pull/479 to control indexing.

1-indexed encoding was chosen because that’s ISO8601 encoding. If you can argue that most systems are 0-indexed, then spec could still be changed (it’s early enough it wouldn’t be too disrupting). Or you could propose to add 0-indexed variants which would not be a breaking change.

MS-SQL, DB2, Snowflake, and Clickhouse all return 1-based weekdays (although some of them have an option to use 0).

Postgres is 0-based for non-ISO weekdays, but 1-based for ISO (we follow them). SQLite is 0-based, but it has to use strftime(%w).

Oracle doesn’t support weekday extraction.

There is also the issue of what the start of the week is set to (we use ICU, and it is configurable, but we only use it for ISO week arithmetic). It is worth mentioning that ICU is 1-based

Hey yes, this is related to substrait integration, I couldn’t find the functionality, but it’s good to know that we support it, just under a different name, that’s no problem and that glue should be provided by the substrait extension