pandas: BUG: groupby transform doesn't respect Series index anymore

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

# sort C within groups defined by A
df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

Issue Description

Suppose I have the following DataFrame

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'bar', 'foo', 'foo'],
    'C': [2.1, 1.9, 3.6, 4.0, 1.9, 7.8, 2.8]
}).convert_dtypes()

print(df)
     A    C
0  foo  2.1
1  bar  1.9
2  foo  3.6
3  bar  4.0
4  bar  1.9
5  foo  7.8
6  foo  2.8

I used to be able to sort C’s values within the groups defined by A like this (proof):

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

(This was sometime around Pandas 1.0.0)

However pandas 1.4.0 produces the following

df.groupby('A')['C'].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  3.6
3  4.0
4  1.9
5  7.8
6  2.8

It’s as if the sort_values() function isn’t even being applied.

Note that df.groupby('A')[['C']].transform(pd.Series.sort_values) works properly.

df.groupby('A')[['C']].transform(pd.Series.sort_values)
     C
0  2.1
1  1.9
2  2.8
3  1.9
4  4.0
5  3.6
6  7.8

Expected Behavior

I would expect df.groupby('A')['C'].transform(pd.Series.sort_values) to actually sort C’s values within the groups defined by A (as it once did). My expected output in this example would be a Series like this.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651536508cdfef8550f93ace7849b00046ee python : 3.10.1.final.0 python-bits : 64 OS : Darwin OS-release : 21.2.0 Version : Darwin Kernel Version 21.2.0: Sun Nov 28 20:28:54 PST 2021; root:xnu-8019.61.5~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.4.0 numpy : 1.22.1 pytz : 2021.3 dateutil : 2.8.2 pip : 21.1.2 setuptools : 57.0.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (11 by maintainers)

Commits related to this issue

code sample for #45648 — committed to simonjayhawkins/pandas by simonjayhawkins 2 years ago

Most upvoted comments

Yes.

rhshadrach on Jan 30, 2022

Can you clarify what you mean by “.transform should return an object indexed the same as the original”? For example, given this DataFrame […]

Clarifying my original comment “In general I expect .transform to return an object indexed the same as the original”

When I think of a “transform”, the prototypical examples that come to mind are shift, diff, cumsum. But the sort_values example you give is a fine example where that intuition breaks down.

AFAICT this is a Hard Problem ™ that @rhshadrach is on top of. Which is good, because I don’t have any bright ideas.

jbrockmendel on Feb 19, 2022