pandas: Resampling uses inconsistent labeling for sub-daily and super-daily frequencies
Resample appears to be use an inconsistent label convention depending on whether the target frequency is sub-daily/daily or super-daily:
- For sub-daily/daily frequencies,
label='left'makes labels at the timestamp corresponding to the start of each frequency bin, andlabel='right'that makes labels at that timestamp plus the frequency (at the timestamp dividing exactly dividing bins). - For super-daily frequencies, both labels appears to shifted minus one day to the left, so the timestamps no longer cleanly divide the frequencies. Moreover, the default label shifts from
'left'to'right'! My guess is that the default was changed here because users were confused bylabel='left'no longer falling inside the expected interval. (I guess I could checkgit blamefor the details.)
I found this behavior quite surprising and confusing. Is it intentional? I would like to rationalize this if possible, because this strikes me as very poor design. The behavior also couples in a weird way with the closed argument (see the linked issues).
From my perspective (as someone who uses monthly and yearly data), the sub-daily/daily behavior makes sense and the super-daily behavior is a bug: there’s no particular reason why it makes sense to use 1 day as an offset for frequencies with super-daily resolution.
Here’s my test script:
for orig_freq, target_freq in [('20s', '1min'), ('20min', '1H'), ('10H', '1D'),
('3D', '10D'), ('10D', '1M'), ('1M', 'Q'), ('3M', 'A')]:
print '%s -> %s:' % (orig_freq, target_freq)
ind = pd.date_range('2000-01-01', freq=orig_freq, periods=10)
s = pd.Series(np.arange(10), ind)
print 'default', s.resample(target_freq, how='first').index[0]
print 'left', s.resample(target_freq, label='left', how='first').index[0]
print 'right', s.resample(target_freq, label='right', how='first').index[0]
20s -> 1min:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 00:01:00
20min -> 1H:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 01:00:00
10H -> 1D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-02 00:00:00
3D -> 10D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-11 00:00:00
10D -> 1M:
default 2000-01-31 00:00:00
left 1999-12-31 00:00:00
right 2000-01-31 00:00:00
1M -> Q:
default 2000-03-31 00:00:00
left 1999-12-31 00:00:00
right 2000-03-31 00:00:00
3M -> A:
default 2000-12-31 00:00:00
left 1999-12-31 00:00:00
right 2000-12-31 00:00:00
About this issue
- Original URL
- State: open
- Created 9 years ago
- Comments: 16 (16 by maintainers)
OK, after digging more deeply into it… I discover that
Mcorresponds to the offset “month end”, which apparently means the start of the last day of the month. To get “month start”, I need to useMS(or likewise,QSorAS).This is… deeply non-intuitive.
I wish there was a way to change this without breaking a bunch of existing code.
I suppose we could add
ME, etc., for month end, but changing the offsetMfrom month-end to month-start seems like a non-starter. Ugh. So I guess we’re left with a documentation issue (https://github.com/pydata/pandas/issues/5023), unless we want to add a hack for resample.From today’s call: seeing as long-term the idea is to decouple Period from Offsets, both
p + pd.offsets.MonthEnd(3)andp + pd.offsets.MonthStart(3)would raise, and so it would probably be best to keep'M'for PeriodI would like to work on this issue. Considering what @MarcoGorelli suggested:
I’ll start with warnings while passing freq=‘M’.
FWIW in finance the APIs typically use ME for “month end” and M for “monthly” when dealing with scheduling. They might have different contexts but M is never used for month end. I would support the change even if its a long way to go.