pandas: BUG: groupby transform doesn't respect Series index anymore
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()
print(df)
A C
0 foo 2.1
1 bar 1.9
2 foo 3.6
3 bar 4.0
4 bar 1.9
5 foo 7.8
6 foo 2.8
# sort C within groups defined by A
df.groupby('A')['C'].transform(pd.Series.sort_values)
C
0 2.1
1 1.9
2 3.6
3 4.0
4 1.9
5 7.8
6 2.8
Issue Description
Suppose I have the following DataFrame
df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()
print(df)
A C
0 foo 2.1
1 bar 1.9
2 foo 3.6
3 bar 4.0
4 bar 1.9
5 foo 7.8
6 foo 2.8
I used to be able to sort C’s values within the groups defined by A like this (proof):
df.groupby('A')['C'].transform(pd.Series.sort_values)
C
0 2.1
1 1.9
2 2.8
3 1.9
4 4.0
5 3.6
6 7.8
(This was sometime around Pandas 1.0.0)
However pandas 1.4.0 produces the following
df.groupby('A')['C'].transform(pd.Series.sort_values)
C
0 2.1
1 1.9
2 3.6
3 4.0
4 1.9
5 7.8
6 2.8
It’s as if the sort_values() function isn’t even being applied.
Note that df.groupby('A')[['C']].transform(pd.Series.sort_values) works properly.
df.groupby('A')[['C']].transform(pd.Series.sort_values)
C
0 2.1
1 1.9
2 2.8
3 1.9
4 4.0
5 3.6
6 7.8
Expected Behavior
I would expect df.groupby('A')['C'].transform(pd.Series.sort_values) to actually sort C’s values within the groups defined by A (as it once did). My expected output in this example would be a Series like this.
C
0 2.1
1 1.9
2 2.8
3 1.9
4 4.0
5 3.6
6 7.8
Installed Versions
INSTALLED VERSIONS
commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.10.1.final.0 python-bits : 64 OS : Darwin OS-release : 21.2.0 Version : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:54 PST 2021; root:xnu-8019.61.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.4.0 numpy : 1.22.1 pytz : 2021.3 dateutil : 2.8.2 pip : 21.1.2 setuptools : 57.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- code sample for #45648 — committed to simonjayhawkins/pandas by simonjayhawkins 2 years ago
Yes.
Clarifying my original comment “In general I expect .transform to return an object indexed the same as the original”
When I think of a “transform”, the prototypical examples that come to mind are shift, diff, cumsum. But the sort_values example you give is a fine example where that intuition breaks down.
AFAICT this is a Hard Problem ™ that @rhshadrach is on top of. Which is good, because I don’t have any bright ideas.