pandas: "Function does not reduce" for multiindex, but works fine for single index.
When I use pd.Series.tolist as a reducer with a single column groupby, it works. When I do the same with multiindex, it does not.
It seems the “fast” cython groupby function, which has no quarrel with reducing into lists, throws an exception if the index is “complex”, which seem to mean multiindex. When that exception is caught, the groupby function falls back to the “pure_python” groupby, which throws a new exception if the reducing function returns a list.
Is this a bug or is there some logic to this which is not apparent to me?
Reproduce:
import pandas as pd
s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
for i in range(0,10):
s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
df2 = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
df = pd.concat([df, df2])
df['gk'] = 'foo'
df['gk2'] = 'bar'
# This works.
df.groupby(['gk']).agg(pd.Series.tolist)
# This does not.
df.groupby(['gk', 'gk2']).agg(pd.Series.tolist)
About this issue
- Original URL
- State: closed
- Created 11 years ago
- Comments: 17 (9 by maintainers)
how does using
tolist
save you any data? its the same data just in a list and comparisons are then hardI think you can one of these:
df.groupby(keys).apply(lambda x: x._get_numeric_data().abs()sum())
or another function that effectively hashes a row togetherdf.groupby(['gk','gk2']).agg(lambda x: tuple(x.tolist()))
will do what you want with the multi-indexes (or single index); as a tuple it is inferred as a reductionYes, that’s the output I want. Assuming I sort by [‘e’,‘a’] first, this is probably faster, too. It is still very convenient to be able to create list values.
Looking at the history of this restriction, it looks like it was accidentally introduced by a transcription error in f3c0a081e2cfc8e073f8461cac5c242d0e4219d0 - at the time, it was an assertion, and it went from
where, as far as I can tell,
dummy
is uninitialized, toThe original assertion was added without comment in 71e9046c52246535d4db1f350e82c3a84d748f88, in response to #612 “Pure python multi-key groupby can’t handle non-numeric results”. Which reveals another oddity: groupby().agg(pd.Series.tolist) works fine for single-key groupings; it only fails for multi-key groupings.