pandas: "Function does not reduce" for multiindex, but works fine for single index.

When I use pd.Series.tolist as a reducer with a single column groupby, it works. When I do the same with multiindex, it does not.

It seems the “fast” cython groupby function, which has no quarrel with reducing into lists, throws an exception if the index is “complex”, which seem to mean multiindex. When that exception is caught, the groupby function falls back to the “pure_python” groupby, which throws a new exception if the reducing function returns a list.

Is this a bug or is there some logic to this which is not apparent to me?

Reproduce:

import pandas as pd
s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
for i in range(0,10):
    s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
    df2 = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
    df = pd.concat([df, df2])
df['gk'] = 'foo'
df['gk2'] = 'bar'

# This works.
df.groupby(['gk']).agg(pd.Series.tolist)

# This does not.
df.groupby(['gk', 'gk2']).agg(pd.Series.tolist)

About this issue

  • Original URL
  • State: closed
  • Created 11 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

how does using tolist save you any data? its the same data just in a list and comparisons are then hard

I think you can one of these:

  • df.groupby(keys).apply(lambda x: x._get_numeric_data().abs()sum()) or another function that effectively hashes a row together
  • df.groupby(['gk','gk2']).agg(lambda x: tuple(x.tolist())) will do what you want with the multi-indexes (or single index); as a tuple it is inferred as a reduction

Yes, that’s the output I want. Assuming I sort by [‘e’,‘a’] first, this is probably faster, too. It is still very convenient to be able to create list values.

Looking at the history of this restriction, it looks like it was accidentally introduced by a transcription error in f3c0a081e2cfc8e073f8461cac5c242d0e4219d0 - at the time, it was an assertion, and it went from

assert(not (isinstance(res, list) and len(res) == len(self.dummy)))

where, as far as I can tell, dummy is uninitialized, to

assert(not isinstance(res, list))

The original assertion was added without comment in 71e9046c52246535d4db1f350e82c3a84d748f88, in response to #612 “Pure python multi-key groupby can’t handle non-numeric results”. Which reveals another oddity: groupby().agg(pd.Series.tolist) works fine for single-key groupings; it only fails for multi-key groupings.

>>> eav.groupby(['attributeName']).agg(pd.Series.tolist)
                      recordId  \
attributeName                    
author         [1, 1, 1, 1, 1]   
title                      [1]   

                                                           value  
attributeName                                                     
author         [Foto N. Afrati, Vinayak Borkar, Michael Carey...  
title              [Map-Reduce Extensions and Recursive Queries]  
>>> eav.groupby(['recordId', 'attributeName']).agg(pd.Series.tolist)
Traceback (most recent call last):
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1863, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1868, in _aggregate_series_fast
    func = self._is_builtin_func(func)
AttributeError: 'BaseGrouper' object has no attribute '_is_builtin_func'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3122, in aggregate
    return self._python_agg_general(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 777, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1865, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1899, in _aggregate_series_pure_python
    raise ValueError('Function does not reduce')
ValueError: Function does not reduce