pandas: Time-based .rolling() fails with .groupby()
Starting with this example:
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.arange(40)})
I can easily compute the rolling mean by identifier:
In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
...
2 30 28.5
31 29.5
3 32 NaN
33 NaN
34 NaN
35 33.5
36 34.5
37 35.5
38 36.5
39 37.5
Name: B, dtype: float64
Now I want to add a timestamp column:
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
'B': np.concatenate((dates, dates)),
'C': np.arange(40)})
The timestamps are ordered within each identifier, but pandas complains:
In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic
Re-sorting leads to a different error:
In [26]: df.sort_values('B', inplace=True)
In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None
But we know that these column names are valid:
n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0 0.000000
20 10.000000
1 7.000000
21 10.500000
2 8.800000
22 11.000000
3 9.857143
23 11.500000
4 10.857143
24 12.500000
...
35 24.714286
15 23.500000
36 25.714286
16 24.500000
37 26.714286
17 25.500000
38 27.714286
18 26.500000
19 25.857143
39 27.500000
Name: C, dtype: float64
It seems like a bug that time-based .rolling() does not work with .groupby().
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 19 (7 by maintainers)
Commits related to this issue
- new procedure: windows related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. https://github.com/pandas-dev... — committed to semio/ddf_utils by semio 8 years ago
- BUG: groupby-rolling with a timedelta closes #13966 xref to #15130, closed by #15175 — committed to jreback/pandas by jreback 7 years ago
- BUG: groupby-rolling with a timedelta closes #13966 xref to #15130, closed by #15175 — committed to jreback/pandas by jreback 7 years ago
- BUG: groupby-rolling with a timedelta (#16091) closes #13966 xref to #15130, closed by #15175 — committed to pandas-dev/pandas by jreback 7 years ago
- BUG: groupby-rolling with a timedelta (#16091) closes #13966 xref to #15130, closed by #15175 — committed to pcluo/pandas by jreback 7 years ago
I stumbled on this yesterday as I was trying to solve the same problem.
I found a workaround, it’s definitely not efficient, but it works.
Hope this helps anyone in the meantime before a bug fix is provided. I haven’t contributed to pandas yet, but having used it so much, maybe it’s about time 😃
pandas 0.21.0 This works ok as above: df.groupby(‘A’).rolling(‘4s’, on=‘B’).C.mean() But this doesn’t: df.groupby(‘A’).rolling(‘4s’, on=‘B’,closed=‘left’).C.mean() Gives error:
Traceback (most recent call last): File “<stdin>”, line 1, in <module> File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 176, in getattr return self[attr] File “/usr/local/lib/python2.7/dist-packages/pandas/core/base.py”, line 255, in getitem return self._gotitem(key, ndim=2) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 1322, in _gotitem return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset) File “/usr/local/lib/python2.7/dist-packages/pandas/core/base.py”, line 678, in _gotitem **kwargs) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 697, in init super(GroupByMixin, self).init(obj, *args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 86, in init self.validate() File “/usr/local/lib/python2.7/dist-packages/pandas/core/window.py”, line 1124, in validate raise ValueError("closed only implemented for datetimelike " ValueError: closed only implemented for datetimelike and offset based windows