pandas: BUG: resample inconsistency with closed left/right
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import numpy as np
import pandas as pd
t = pd.date_range("2011-01-01T00:00:00", "2011-02-01T00:00:00", freq="10T")
y = np.ones(len(t))
df = pd.DataFrame(data=y, index=t, columns=['y'])
df
# default for '1M' is closed-left
resampled_df = df.resample('1M', closed='left', label='right').sum()
resampled_df
# bin edges are last day of month @ start of day
# this is why the whole day of 2011-01-31 has been included in the second bin
resampled_df.index[0]
# now with closed-right
# I am expecting 1 data point (precisely on the bin edge @ 2011-01-31 00:00:00) to shift into the first bin
resampled_df = df.resample('1M', closed='right', label='right').sum()
resampled_df
# now the first bin includes the whole day of 2011-01-31
# despite the fact that the bin edge is 2011-01-31 00:00:00
resampled_df.index[0]
executed…
>>> import numpy as np
>>> import pandas as pd
>>>
>>> t = pd.date_range("2011-01-01T00:00:00", "2011-02-01T00:00:00", freq="10T")
>>> y = np.ones(len(t))
>>> df = pd.DataFrame(data=y, index=t, columns=['y'])
>>>
>>> df
y
2011-01-01 00:00:00 1.0
2011-01-01 00:10:00 1.0
2011-01-01 00:20:00 1.0
2011-01-01 00:30:00 1.0
2011-01-01 00:40:00 1.0
... ...
2011-01-31 23:20:00 1.0
2011-01-31 23:30:00 1.0
2011-01-31 23:40:00 1.0
2011-01-31 23:50:00 1.0
2011-02-01 00:00:00 1.0
[4465 rows x 1 columns]
>>>
>>> # default for '1M' is closed-left
>>> resampled_df = df.resample('1M', closed='left', label='right').sum()
>>>
>>> resampled_df
y
2011-01-31 4320.0
2011-02-28 145.0
>>>
>>> # bin edges are last day of month @ start of day
>>> # this is why the whole day of 2011-01-31 has been included in the second bin
>>> resampled_df.index[0]
Timestamp('2011-01-31 00:00:00', freq='M')
>>>
>>> # now with closed-right
>>> # I am expecting 1 data point (precisely on the bin edge @ 2011-01-31 00:00:00) to shift into the first bin
>>> resampled_df = df.resample('1M', closed='right', label='right').sum()
>>>
>>> resampled_df
y
2011-01-31 4464.0
2011-02-28 1.0
>>>
>>> # now the first bin includes the whole day of 2011-01-31
>>> # despite the fact that the bin edge is 2011-01-31 00:00:00
>>> resampled_df.index[0]
Timestamp('2011-01-31 00:00:00', freq='M')
Problem description
Switching between closed left/right should affect at most 1 data point in a timeseries, but in the example above it affects an entire day of data.
We were also surprised that ‘1M’ (calendar month end) is defined as last day of month @ start of day.
Expected Output
see comments in example above
Output of pd.show_versions()
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.76-linuxkit
Version : #1 SMP Tue May 26 11:42:35 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.2
numpy : 1.19.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0
Cython : None
pytest : 5.3.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.16.1
xlrd : None
xlwt : None
numba : None
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 5
- Comments: 19 (1 by maintainers)
I agree with @phofl
Using the example above, the 1 month bin range would be
Since label does not impact the results but merely the display, we can stick to
label='right'
for analysis.If
closed='left'
, then for the 2 cases:If
closed='right'
, then for the 2 cases:When we switch between left and right in your example, there will only be 1 data point in
closed='right'
. This is consistent with the documentation and how I would think of date range; when the range ends at 2011-01-31 then intuitively I would expect every data point for that day up till midnight.