pandas: BUG: resample inconsistency with closed left/right

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

t = pd.date_range("2011-01-01T00:00:00", "2011-02-01T00:00:00", freq="10T")
y = np.ones(len(t))
df = pd.DataFrame(data=y, index=t, columns=['y'])

df

# default for '1M' is closed-left
resampled_df = df.resample('1M', closed='left', label='right').sum()

resampled_df

# bin edges are last day of month @ start of day
# this is why the whole day of 2011-01-31 has been included in the second bin
resampled_df.index[0]

# now with closed-right
# I am expecting 1 data point (precisely on the bin edge @ 2011-01-31 00:00:00) to shift into the first bin
resampled_df = df.resample('1M', closed='right', label='right').sum()

resampled_df

# now the first bin includes the whole day of 2011-01-31
# despite the fact that the bin edge is 2011-01-31 00:00:00
resampled_df.index[0]

executed…

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> t = pd.date_range("2011-01-01T00:00:00", "2011-02-01T00:00:00", freq="10T")
>>> y = np.ones(len(t))
>>> df = pd.DataFrame(data=y, index=t, columns=['y'])
>>> 
>>> df
                       y
2011-01-01 00:00:00  1.0
2011-01-01 00:10:00  1.0
2011-01-01 00:20:00  1.0
2011-01-01 00:30:00  1.0
2011-01-01 00:40:00  1.0
...                  ...
2011-01-31 23:20:00  1.0
2011-01-31 23:30:00  1.0
2011-01-31 23:40:00  1.0
2011-01-31 23:50:00  1.0
2011-02-01 00:00:00  1.0

[4465 rows x 1 columns]
>>> 
>>> # default for '1M' is closed-left
>>> resampled_df = df.resample('1M', closed='left', label='right').sum()
>>> 
>>> resampled_df
                 y
2011-01-31  4320.0
2011-02-28   145.0
>>> 
>>> # bin edges are last day of month @ start of day
>>> # this is why the whole day of 2011-01-31 has been included in the second bin
>>> resampled_df.index[0]
Timestamp('2011-01-31 00:00:00', freq='M')
>>> 
>>> # now with closed-right
>>> # I am expecting 1 data point (precisely on the bin edge @ 2011-01-31 00:00:00) to shift into the first bin
>>> resampled_df = df.resample('1M', closed='right', label='right').sum()
>>> 
>>> resampled_df
                 y
2011-01-31  4464.0
2011-02-28     1.0
>>> 
>>> # now the first bin includes the whole day of 2011-01-31
>>> # despite the fact that the bin edge is 2011-01-31 00:00:00
>>> resampled_df.index[0]
Timestamp('2011-01-31 00:00:00', freq='M')

Problem description

Switching between closed left/right should affect at most 1 data point in a timeseries, but in the example above it affects an entire day of data.

We were also surprised that ‘1M’ (calendar month end) is defined as last day of month @ start of day.

Expected Output

see comments in example above

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2a7d3326dee660824a8433ffd01065f8ac37f7d6
python           : 3.8.1.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.76-linuxkit
Version          : #1 SMP Tue May 26 11:42:35 UTC 2020
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.2
numpy            : 1.19.2
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.1.0
Cython           : None
pytest           : 5.3.5
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 1.3.6
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.4 (dt dec pq3 ext lo64)
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : 0.16.1
xlrd             : None
xlwt             : None
numba            : None

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 5
  • Comments: 19 (1 by maintainers)

Most upvoted comments

I agree with @phofl

Using the example above, the 1 month bin range would be

  1. 2010-12-31 to 2011-01-31
  2. 2011-01-31 to 2011-02-28

Since label does not impact the results but merely the display, we can stick to label='right' for analysis.

If closed='left', then for the 2 cases:

  1. 2011-01-31 bin: 2010-12-31 <= date < 2011-01-31
  2. 2011-02-28 bin: 2011-01-31 <= date < 2011-02-28

If closed='right', then for the 2 cases:

  1. 2011-01-31 bin: 2010-12-31 < date <= 2011-01-31
  2. 2011-02-28 bin: 2011-01-31 < date <= 2011-02-28

When we switch between left and right in your example, there will only be 1 data point in closed='right'. This is consistent with the documentation and how I would think of date range; when the range ends at 2011-01-31 then intuitively I would expect every data point for that day up till midnight.