pandas: Resampling uses inconsistent labeling for sub-daily and super-daily frequencies

xref #2665 xref #5440

Resample appears to be use an inconsistent label convention depending on whether the target frequency is sub-daily/daily or super-daily:

  • For sub-daily/daily frequencies, label='left' makes labels at the timestamp corresponding to the start of each frequency bin, and label='right' that makes labels at that timestamp plus the frequency (at the timestamp dividing exactly dividing bins).
  • For super-daily frequencies, both labels appears to shifted minus one day to the left, so the timestamps no longer cleanly divide the frequencies. Moreover, the default label shifts from 'left' to 'right'! My guess is that the default was changed here because users were confused by label='left' no longer falling inside the expected interval. (I guess I could check git blame for the details.)

I found this behavior quite surprising and confusing. Is it intentional? I would like to rationalize this if possible, because this strikes me as very poor design. The behavior also couples in a weird way with the closed argument (see the linked issues).

From my perspective (as someone who uses monthly and yearly data), the sub-daily/daily behavior makes sense and the super-daily behavior is a bug: there’s no particular reason why it makes sense to use 1 day as an offset for frequencies with super-daily resolution.

CC @Cd48 @kdebrab


Here’s my test script:

for orig_freq, target_freq in [('20s', '1min'), ('20min', '1H'), ('10H', '1D'),
                               ('3D', '10D'), ('10D', '1M'), ('1M', 'Q'), ('3M', 'A')]:
    print '%s -> %s:' % (orig_freq, target_freq)
    ind = pd.date_range('2000-01-01', freq=orig_freq, periods=10)
    s = pd.Series(np.arange(10), ind)
    print 'default', s.resample(target_freq, how='first').index[0]
    print 'left', s.resample(target_freq, label='left', how='first').index[0]
    print 'right', s.resample(target_freq, label='right', how='first').index[0]
20s -> 1min:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 00:01:00
20min -> 1H:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 01:00:00
10H -> 1D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-02 00:00:00
3D -> 10D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-11 00:00:00
10D -> 1M:
default 2000-01-31 00:00:00
left 1999-12-31 00:00:00
right 2000-01-31 00:00:00
1M -> Q:
default 2000-03-31 00:00:00
left 1999-12-31 00:00:00
right 2000-03-31 00:00:00
3M -> A:
default 2000-12-31 00:00:00
left 1999-12-31 00:00:00
right 2000-12-31 00:00:00

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

OK, after digging more deeply into it… I discover that M corresponds to the offset “month end”, which apparently means the start of the last day of the month. To get “month start”, I need to use MS (or likewise, QS or AS).

This is… deeply non-intuitive.

I wish there was a way to change this without breaking a bunch of existing code.

I suppose we could add ME, etc., for month end, but changing the offset M from month-end to month-start seems like a non-starter. Ugh. So I guess we’re left with a documentation issue (https://github.com/pydata/pandas/issues/5023), unless we want to add a hack for resample.

From today’s call: seeing as long-term the idea is to decouple Period from Offsets, both p + pd.offsets.MonthEnd(3) and p + pd.offsets.MonthStart(3) would raise, and so it would probably be best to keep 'M' for Period

I would like to work on this issue. Considering what @MarcoGorelli suggested:

Concretely, I’d suggest:

* pandas 2.1: passing `freq='M'` warns users that `'M'` is deprecated and to use `'MS'` (month start) or `'ME'` (month end) instead

* pandas 3.0: passing `freq='M'` errors, advising users to use `'MS'` (month start) or `'ME'` (month end) instead

I’ll start with warnings while passing freq=‘M’.

FWIW in finance the APIs typically use ME for “month end” and M for “monthly” when dealing with scheduling. They might have different contexts but M is never used for month end. I would support the change even if its a long way to go.