dask: dask.dataframe cannot handle timestamps-as-objects, even though Pandas can

What happened:

Comparing a column that is a timestamp-with-dtype-object to a datetime.datetime() object blows up.

What you expected to happen:

In Pandas, the comparison works.

Minimal Complete Verifiable Example:

This happens when using a fairly new Pyarrow feature that allows loading timestamps as objects in order to handle larger date ranges.

import pyarrow as pa
from pyarrow import parquet as pq
import numpy as np
from dask import dataframe as dd
import pandas as pd
from pandas.testing import assert_frame_equal
from datetime import datetime


def make_df_with_timestamps():
    # Some of the milliseconds timestamps deliberately don't fit in the range
    # that is possible with nanosecond timestamps.
    df = pd.DataFrame(
        {
            "dateTimeMs": [
                np.datetime64("0001-01-01 00:00", "ms"),
                np.datetime64("2012-05-02 12:35", "ms"),
                np.datetime64("2012-05-03 15:42", "ms"),
                np.datetime64("3000-05-03 15:42", "ms"),
            ],
            "dateTimeNs": [
                np.datetime64("1991-01-01 00:00", "ns"),
                np.datetime64("2012-05-02 12:35", "ns"),
                np.datetime64("2012-05-03 15:42", "ns"),
                np.datetime64("2050-05-03 15:42", "ns"),
            ],
        }
    )
    # Not part of what we're testing, just ensuring that the inputs are what we
    # expect.
    assert (df.dateTimeMs.dtype, df.dateTimeNs.dtype) == (
        # O == object, <M8[ns] == timestamp64[ns]
        np.dtype("O"),
        np.dtype("<M8[ns]"),
    )
    return df


def test_timestamp_as_object_out_of_range(tmpdir):
    # Out of range timestamps can be converted Arrow and reloaded into Dask
    # DataFrame with no loss of information if the timestamp_as_object option
    # is True.
    pandas_df = make_df_with_timestamps()
    table = pa.Table.from_pandas(pandas_df)
    filename = tmpdir / "timestamps_from_pandas.parquet"
    pq.write_table(table, filename, version="2.0")

    dask_df = dd.read_parquet(
        filename, engine="pyarrow", arrow_to_pandas={"timestamp_as_object": True}
    )

    # Check roundtripping:
    print(pandas_df)
    pandas_df2 = dask_df.compute()
    print(pandas_df2)
    assert (pandas_df2.dateTimeMs.dtype, pandas_df2.dateTimeNs.dtype) == (
        # O == object, <M8[ns] == timestamp64[ns]
        np.dtype("O"),
        np.dtype("<M8[ns]"),
    )

    assert_frame_equal(pandas_df, pandas_df2)

    # Check operations:
    print("Dask metadata", dask_df.dateTimeMs.dtype, dask_df.dateTimeNs.dtype)

    # Comparing the nanoseconds datetime column works fine:
    pandas_df[pandas_df.dateTimeNs > datetime(2050, 1, 1, 0, 0, 0)]
    dask_df[dask_df.dateTimeNs > datetime(2050, 1, 1, 0, 0, 0)].compute()

    # Comparing the milliseconds datetime column blows up:
    pandas_df = pandas_df[pandas_df.dateTimeMs > datetime(2500, 1, 1, 0, 0, 0)]
    print("Filtered to years>2500", pandas_df)
    # This next line blows up:
    dask_df = dask_df[dask_df.dateTimeMs > datetime(2500, 1, 1, 0, 0, 0)]
    assert_frame_equal(pandas_df, dask_df.compute())

This blows up with TypeError("'>' not supported between instances of 'str' and 'datetime.datetime'") on the line doing the Dask comparison (line before last).

Anything else we need to know?:

I will attempt to provide a PR fixing this.

Environment:

This is Dask from git as of Feb 15, 2021, but was reported to me by user using some other, released version. Python 3.8 on Linux.

Package         Version              Location
--------------- -------------------- -------------------------
attrs           20.3.0
dask            2021.2.0+8.g02b466d2 /home/itamarst/Devel/dask
fsspec          0.8.5
iniconfig       1.1.1
locket          0.2.1
numpy           1.20.1
packaging       20.9
pandas          1.2.2
partd           1.1.0
pip             21.0.1
pluggy          0.13.1
py              1.10.0
pyarrow         3.0.0
pyparsing       2.4.7
pytest          6.2.2
python-dateutil 2.8.1
pytz            2021.1
PyYAML          5.4.1
setuptools      49.1.3
six             1.15.0
toml            0.10.2
toolz           0.11.1

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 23 (22 by maintainers)

Most upvoted comments

I am not planning on working on it, it turned out the users requesting it were trying to do this as workaround, and solving the underlying issue was better approach.

Interestingly, if I go back to the Arrow variant and do two things, the original MRE passes:

  1. Load with additional meta=pandas_df.head() argument to read_parquet().
  2. Hardcode (as temporary workaround) scalar for dtype("O") to np.datetime64("3000-01-01", "ms") (np.datetime64("1970-01-01", "ms") somehow gets reset back to nanoseconds, and doesn’t work?!).

So this suggests there are two different places meta is coming from (maybe _meta vs _meta_empty?).

Seems like reasonable stopping point, will continue later in the week.

Sometime this week, hopefully.