pandas: Time-based .rolling() fails with .groupby()

Starting with this example:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 19 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it’s definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven’t contributed to pandas yet, but having used it so much, maybe it’s about time 😃

pandas 0.21.0 This works ok as above: df.groupby(‘A’).rolling(‘4s’, on=‘B’).C.mean() But this doesn’t: df.groupby(‘A’).rolling(‘4s’, on=‘B’,closed=‘left’).C.mean() Gives error:

Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 176, in getattr return self[attr] File “/usr/local/lib/python2.7/dist-packages/pandas/core/base.py”, line 255, in getitem return self._gotitem(key, ndim=2) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 1322, in _gotitem return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset) File “/usr/local/lib/python2.7/dist-packages/pandas/core/base.py”, line 678, in _gotitem **kwargs) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 697, in init super(GroupByMixin, self).init(obj, *args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 86, in init self.validate() File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 1124, in validate raise ValueError("closed only implemented for datetimelike " ValueError: closed only implemented for datetimelike and offset based windows