pandas: BUG: DataFrame.groupby drops timedelta column in v1.3.0

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame(
    {
        "duration":[pd.Timedelta(i, unit="days") for i in range(1,6)],
        "value":[1,0,1,2,1]
    }
)

df.groupby("value").sum()

results in:

Empty DataFrame
Columns: []
Index: [0, 1, 2]

The same behaviour occurs whether it’s sum, mean, median etc

Note the following gives close to the expected result:

df.groupby("value")["duration"].sum()
value
0   2 days
1   9 days
2   4 days
Name: duration, dtype: timedelta64[ns]

Also, this problem may be unique to Timedeltas as the following similar example, has no problem:

df = pd.DataFrame(
    {
        "duration":[i for i in range(1,6)],
        "value":[1,0,1,2,1]
    }
)

df.groupby("value").sum()

gives:

       duration
value
0             2
1             9
2             4

Problem description

In v1.3.0 (and master, commit dad3e7) when performing a groupby operation on a dataframe a timedelta column goes missing. This behaviour does not occur in pandas 1.2.5

Expected Output

The expected output is given by pandas 1.2.5:

      duration
value
0       2 days
1       9 days
2       4 days

It is a dataframe, with one column “duration”, indexed by the groupby keys

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f47020034e752baf0250483053340971b0 python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.3.0 numpy : 1.21.0 pytz : 2021.1 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 33 (33 by maintainers)

Commits related to this issue

Most upvoted comments

+1 on defaulting numeric_only to False. To have code

df[["a", "b"]].groupby("a").sum()

not result in summing b is surprising.

I’m +1 on removing the warning for 1.3.x and only having a FutureWarning in 1.4 for the pending behavior change. @jbrockmendel - any thoughts here?

I’m fine with that.

I’m reviewing the cython -> python aggregation fallback behavior and specifically how 6b94e2431536eac4d943e55064d26deaac29c92b affects our test case. Once I understand the specific error sequence, and details like how numeric_only is shepherded, I should have a good idea of where to pursue a clean fix. If it’s just incidental that the cython code can’t process these timedelta aggregations, then that strikes me as the place to make a change, rather than restoring the catch-all fallback behavior.