pandas: BUG: DataFrame.groupby drops timedelta column in v1.3.0
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
df = pd.DataFrame(
{
"duration":[pd.Timedelta(i, unit="days") for i in range(1,6)],
"value":[1,0,1,2,1]
}
)
df.groupby("value").sum()
results in:
Empty DataFrame
Columns: []
Index: [0, 1, 2]
The same behaviour occurs whether it’s sum, mean, median etc
Note the following gives close to the expected result:
df.groupby("value")["duration"].sum()
value
0 2 days
1 9 days
2 4 days
Name: duration, dtype: timedelta64[ns]
Also, this problem may be unique to Timedeltas as the following similar example, has no problem:
df = pd.DataFrame(
{
"duration":[i for i in range(1,6)],
"value":[1,0,1,2,1]
}
)
df.groupby("value").sum()
gives:
duration
value
0 2
1 9
2 4
Problem description
In v1.3.0 (and master, commit dad3e7) when performing a groupby operation on a dataframe a timedelta column goes missing. This behaviour does not occur in pandas 1.2.5
Expected Output
The expected output is given by pandas 1.2.5:
duration
value
0 2 days
1 9 days
2 4 days
It is a dataframe, with one column “duration”, indexed by the groupby keys
Output of pd.show_versions()
INSTALLED VERSIONS
commit : f00ed8f47020034e752baf0250483053340971b0 python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None
pandas : 1.3.0 numpy : 1.21.0 pytz : 2021.1 dateutil : 2.8.1 pip : 19.2.3 setuptools : 41.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 33 (33 by maintainers)
Commits related to this issue
- code sample for #42395 — committed to simonjayhawkins/pandas by simonjayhawkins 3 years ago
+1 on defaulting
numeric_onlyto False. To have codenot result in summing
bis surprising.I’m fine with that.
I’m reviewing the cython -> python aggregation fallback behavior and specifically how 6b94e2431536eac4d943e55064d26deaac29c92b affects our test case. Once I understand the specific error sequence, and details like how numeric_only is shepherded, I should have a good idea of where to pursue a clean fix. If it’s just incidental that the cython code can’t process these timedelta aggregations, then that strikes me as the place to make a change, rather than restoring the catch-all fallback behavior.